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Abstract 


» 

An  approach  to  computer  recognition  of  continuous 
speech  through  phoneme  identification  is  presented.  Speech 
data  is  recorded  on  a tape  recorder,  digitally  sampled. 

Fast  Fourier  Transformed  and  logarithmically  compressed 
into  16  frequency  channels.  This  digitized  data  is  first 
processed  by  a crosscorrelation  and  then  by  a decision  program. 

After  the  phonemes  are  located,  a ranking  of  the  selections 
is  made.  This  procedure  was  used  on  both  the  discrete  and 
continuous  speech  of  five  different  speakers.  Phoneme 
averaging  was  used  to  calculate  a universal  set  of  eight 
prototype  phonemes.  For  the  word  groups  analyzed,  the 
final  identification  and  location  rates  were  83.5  and  95.9 
percent.  For  the  verification  sentences  analyzed  the  final 
identification  and  location  rates  were  77.9  and  91.1  percent 
for  discrete  speech  and  66.1  and  88.2  percent  for  continuous 
speech.  For  the  test  sentences  analyzed  the  final  identifi- 
cation  and  location  rates  were  62.3  and  77.9  for  discrete 
speech  and  48.6  and  66.1  percent  for  continuous  speech. 

' 


COMPUTER  IDENTIFICATION 
OF  PHONEMES 
IN  CONTINUOUS  SPEECH 

I . Introduction 

This  research  effort  is  a continuation  of  the  work 
initiated  by  Major  Ralph  W.  Neyman  and  continued  by 
Captain (s)  William  R.  Hensley,  Michael  F.  Guyote,  and 
Patrick  L.  Sisson  on  the  problem  of  computer  speech  recog- 
nition. The  long  term  goal  of  this  research  is  to  achieve 
the  recognition  of  unrestricted,  continuous  speech  by 
machine. 

In  various  situations,  such  as  the  highly  automated 
cockpit  of  today's  aircraft,  the  restriction  on  man's 
ability  to  communicate  to  a computer  or  machine  through  the 
use  of  conventional  input/output  peripherals  is  becoming 
increasingly  intolerable.  The  advantages  of  a spoken  word 
input  to  a computer  or  machine  have  been  recognized  and 
techniques  to  solve  this  problem  are  being  analyzed  by  many 
research  groups  throughout  the  world  (Ref  1:319).  Experi- 
ments comparing  speech  with  other  modes  of  communication, 
such  as  typing,  have  indicated  that  information  is  trans- 
ferred almost  twice  as  fast  with  speech  (Ref  19:2).  Thus, 
speech  input  will  help  optimize  the  man/machine  interface. 

Present  literature  expresses  the  opinion  that  a 
continuous  speech  recognition  system  is  still  years  in  the 
future,  and  even  then,  the  system  may  be  highly  restrictive 
(Ref  22:531).  However,  the  encouraging  results  presented 


in  the  Neyman,  Hensley,  Guyote,  and  Sisson  theses  seem  to 
contradict  this  belief  and  are  the  basis  for  this  continued 
research  (Ref  11,  13,  19) . 

Motivation 

All  current  systems  of  voice  control  that  rely  on  the 
computer  recognition  of  human  speech  are  based  on  a highly 
constrained  manner  of  speaking.  A sample  of  the  more 
accurate  speech  recognition  systems  is  listed  in  Table  I 
(Ref  8,  18,  22:531).  These  speech  recognition  systems  per- 
form only  in  response  to  an  isolated-word  or  isolated- 
phrase  input.  Further,  the  general  constraints  that  must 
be  observed  when  using  these  machines  include  any  one  or 
combination  of  the  following: 

1.  All  commands  must  be  separated  by  a long  pause. 

2.  Vocabulary  words  are  limited  to  a class  size  of 
approximately  560. 

3.  Commands  must  be  spoken  in  a specified  word  order. 

4.  The  speech  recognition  system  must  be  programmed 
to  the  unique  speaking  characteristics  of  each 
user  who  must  be  very  consistent  in  his  speech. 

The  ultimate  speech  recognition  system  is  one  that 
would  respond  to  a natural,  unrestricted  voice  input.  When 
a person  speaks,  a complex  acoustic  signal  is  generated. 

This  signal  is  a function  of  the  size  and  shape  of  the 
individual's  vocal  cavity  and  movements  of  the  tongue,  lips, 
and  teeth.  Also,  the  nature  of  the  speech  signil  itself 
changes  with  the  individual's  rate  of  speaking,  emotional 
state,  and  context  of  the  utterance.  Therefore,  instead  of 
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xas  Instruments  10  digits,  continuous  speech 

Doddington  (1973) 


trying  to  recognize  discrete  words,  of  which  there  are 
literally  tens  of  thousands  in  the  English  language,  this 
research  is  concerned  with  identifying  the  fundamental 
elements  of  words.  These  elements  are  called  phonemes 
and  they  are  defined  to  be  the  smallest  distinguishable  units 
of  speech. 

Experiments  indicate  that  one-fourth  to  one-half  of 
the  words  in  normal  conversational  speech  are  unintelligible 
when  taken  out  of  context  and  heard  in  isolation  (Ref  31:41). 
This  implies  that  a system  for  understanding  continuous 
speech  must  use  context  related  rules  to  identify  the  words 
in  the  sentence.  A machine  dedicated  to  recognizing  iso- 
lated words  would  need  an  extremely  large  memory  capacity 
to  contain  all  the  words  in  the  English  language  in  addition 
to  the  related  context  programs.  However,  a more  versatile 
recognition  system  that  relies  upon  phoneme  detection  would 
only  require  storage  for  approximately  100  phonemes  and  the 
related  context  programs. 

Military  applications  for  a speech  recognition  system 
include  security,  command  and  control,  data  transmission 
and  communication,  and  the  processing  of  distorted  speech. 
Table  II  presents  a representative  listing  of  these  potential 
applications  (Ref  1:310). 

Objective 


The  main  objective  of  this  research  was  to  improve  and 
change  as  necessary  the  speech  recognition  scheme  previously 


Table  II 


Military  Tasks  for  Possible  Automation 


1)  Security 

1.1  Speaker  Verification  (Authentication) 

1.2  Speaker  Identification  (Recognition) 

1.3  Determining  emotional  state  of  speaker  (e.g., 
stress  effects) 

1.4  Recognition  of  spoken  codes 

1.5  Secure  access  voice  identification,  whether  or  not 

in  combination  with  fingerprints,  facial  information, 
identity  card,  signature,  etc. 

1.6  Surveillance  of  communication  channels. 

2)  Command  and  Control 

2.1  System  control  (ships,  aircraft,  fire  control, 
situation  displays,  etc.) 

2.2  Voice-operated  computer  input/output  (each  telephone 
a terminal) 

2.3  Data  handling  and  record  control 

2.4  Material  handling  (mail,  baggage,  publications, 
industrial  applications) 

2.5  Remote  control  (dangerous  material) 

2.6  Administrative  record  control 

3)  Data  Transmission  and  Communication 

3.1  Speech  synthesis 

3.2  Vocoder  systems 

3.3  Bandwidth  reduction  or,  more  general,  bit-rate 
reduction 

3.4  Ciphering/coding/scrambling 

4)  Processing  Distorted  Speech 

4.1  Diver  speech 

4.2  Astronaut  communication 

4.3  Underwater  telephone 

4.4  Oxygen  mask  speech 

4.5  High  "G"  force  speech 


developed  by  Neyman,  et  al.  (Ref  11,  13,  19) . Also,  a method 
was  developed  to  identify  phonemes  from  continuous  speech  so 
that  an  average  or  universal  set  of  prototype  phonemes  could 
be  calculated.  This  set  of  universal  prototype  phonemes  was 
then  used  to  locate  and  identify  similar  phonemes  in  the 
continuous  speech  of  dissimilar  speakers  using  pattern 
matching  and  crosscorrelation  techniques. 

Although  analyzed  spectral  information  can  produce  some 
recognition,  it  cannot  do  the  entire  job.  To  quote  Flanagan: 

Automatic  speech  recognition — as  the  human 
accomplishes  it — will  probably  be  possible  only 
through  the  proper  analysis  and  application  of 
grammatical,  contextual,  and  semantic  constraints. 

This  approach  also  presumes  an  acoustic  analysis 
which  preserves  the  same  information  that  the 
human  transducer  (i.e.  , the  ear)  does.  It  is 
clear,  too,  that  for  a given  accuracy  of  recognition, 
a trade  can  be  made  between  the  necessary  linguistic 
constraints,  and  complexity  of  vocabulary,  and  the 
number  of  speakers  (Ref  9:163). 

In  recognition  of  the  above,  this  research  does  not 
include  linguistic  or  syntactic  recognition  schemes  since 
the  entire  set  of  phonemes  for  the  English  language  was 
not  developed.  However,  the  rank  ordering  of  the  identified 
phonemes  by  a decision  scheme  would  permit  the  use  of  a 
higher-order  linguistic/syntactic  program. 

Another  objective  was  to  identify  the  application  of 
current  microelectronic  devices  or  the  need  for  a special 
type  of  device  to  implement  this  speech  recognition  scheme. 


Scope 


The  desired  result  of  this  research  was  to  implement  a 
technique  to  develop  universal  phonemes  and  to  locate  and 
identify  these  phonemes  in  discrete  and  continuous  speech. 

A set  of  eight  phonemes  from  Table  III  was  used  in  this  re- 
search. A representative  phoneme  set  was  chosen  from  word 
groups  spoken  by  the  authors  to  develop  an  average  proto- 
type phoneme  set.  This  phoneme  set  was  then  correlated  with 
discrete  and  continuous  sentence  samples  spoken  by  the  authors. 
This  was  done  to  verify  that  the  phonemes  could  be  located 
and  identified  in  speech  from  which  the  average  prototype 
phoneme  set  was  calculated. 

To  verify  that  the  phoneme  set  was  universal  and  could 
be  used  to  locate  and  identify  phonemes  in  the  speech  of 
others,  it  was  correlated  with  sentences  spoken  by  three 
different  speakers.  In  total,  twelve  sentences  composed 
of  words  containing  the  eight  prototype  phonemes  were 
analyzed. 

A hardware  modelling  study  was  also  done  to  implement 
this  speech  recognition  scheme.  A major  goal  of  this  study 
was  to  identify  either  the  application  of  current  micro- 
electronic devices  or  the  need  for  the  development  of  special 
devices  that  could  be  used  to  make  this  speech  recognition 
scheme  a reality. 


Fundamental  Phoneme  Set 
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II.  Data  Acquisition 


The  data  acquisition  scheme  used  to  locate  and  identify 
prototype  phonemes  is  based  on  the  restriction  that  the  speech 
data  must  be  similar  to  that  which  is  processed  by  the  human 
ear.  The  basic  function  of  the  outer  ear  is  to  transform 
the  acoustic  pressure  variations  of  sound  energy  so  that  it 
can  be  used  by  the  frequency  analysis  portion  of  the  middle 
ear  and  cerebral  cortex  to  recognize  speech  (Ref  15) . The 
data  acquisition  and  processing  scheme  that  best  models  the 
function  of  the  human  ear  consists  of  the  following  elements: 
a speaker,  a microphone,  an  audio  tape  recorder,  an  analog- 
to-digital  computer,  a Fast-Fourier  Transform  (FFT)  computer 
algorithm,  and  a crosscorrelation/decision  computer  algorithm. 
An  overview  of  the  speech  recognition  process  showing  the 
parallels  between  human  and  machine  recognition  is  shown  in 
Figure  1. 

The  data  acquisition  process  consists  of  reciting  the 
desired  words,  phrases,  or  sentences  into  one  channel  of  a 
reel-to-reel  stereo  tape  recorder.  Tone  markers  of  2kHz 
are  recorded  on  the  second  channel  to  identify  the  beginning 
and  end  of  each  group  of  data,  as  well  as  the  change  of 
speakers.  Thus,  these  tones  identify  discrete  blocks  of 
speech  data  and  serve  as  a calibration  reference  point  for 
the  personnel  who  operate  the  preanalysis  FFT  computer 
algorithm.  The  scheme  used  to  record  the  speech  data  is 
shown  in  Figure  2. 
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SPEECH  RECOGNITION 


Figure  1.  An  Overview  of  Speech  Recognition  Showing  Parallels 
Between  Human  and  Machine  Recognition  (Ref  31) 


III.  Data  Preprocessing 


The  initial  processing  of  the  analog  speech  data  was 
accomplished  by  the  Analog/Hybrid  Systems  Branch  of  the  ASD 
Computer  Center.  Figure  3 illustrates  the  hardware  scheme 
used. 

Analog-to-Digi tal  Conversion 

The  COMCOR  CI-5000/6  analog- to-digital  computer  was 
used  to  digitize  the  analog  speech  data.  However,  because 
its  input  amplifiers  were  limited  to  a bandwidth  of  2.5kHz, 
it  was  necessary  to  modify  the  original  speech  data.  Since 
normal  speech  contains  important  frequencies  up  to  5kHz, 
the  bandwidth  limitations  of  the  computer's  amplifiers  were 
compensated  for  in  the  following  manner.  The  original  speech 
data  was  played  at  a speed  of  3-3/4  inches-per-second , the 
resulting  audio  signal  was  low-pass  filtered  to  2.5kHz,  and 
this  signal  was  then  sampled  at  twice  this  frequency  (5kHz) 
in  order  to  satisfy  the  Nyquist  sampling  requirements.  This 
procedure,  however,  is  equivalent  to  playing  the  tape  at  its 
originally  recorded  speed  of  7-1/2  inches-per-second,  low- 
pass  filtering  to  5kHz,  and  sampling  the  final  output  at 
10kHz.  In  addition,  before  the  analog  speech  data  was  digi- 
tized, the  filtered  signal  was  amplified  to  100  volts  to 
insure  a signal  of  sufficient  amplitude  to  permit  accurate 
sampling  by  the  11-bit  analog-to-digital  converters  of  the 
COMCOR  computer. 


Figure  3.  Data  Preprocessing  Scheme 


| 


The  digital  output  from  the  COMCOR  computer  is  an  11-bit 
binary  representation  of  a four-digit  decimal  number  that 
describes  the  amplitude  of  the  analog  speech  data  at  a 
specific  instant  of  time.  This  format  of  the  original 
speech  data  was  then  used  as  the  input  signal  to  the  fre- 
quency analysis  segment  of  the  data  pre-processing  sequence. 

Frequency  Analysis 

In  order  to  identify  the  frequency  components  of  a 
particular  phoneme,  the  digitized  speech  data  was  converted 
to  an  equivalent  frequency  representation.  The  properties 
of  the  Fast-Fourier  Transform  (FFT)  that  permit  the  computa- 
tion of  a frequency  representation  of  a time-varying  signal 
were  used  to  accomplish  this  requirement  (Ref  2:41-52). 

The  desired  results  were  obtained  using  a Xerox  Sigma-7 
digital  computer  and  the  Analog/Hybrid  Systems  Branch  AMPSPC 
FFT  algorithm.  The  AMPSPC  algorithm  samples  the  digitized 
speech  data  of  the  COMCOR  computer  in  sets  of  128  samples 
and  creates  a (1  x 128)  input  array  for  the  FFT  algorithm. 
Since  each  frequency  sample  represents  the  analog  output  at 
each  10*4  (1/10  kHz)  second  time  increment,  128  of  the 
samples  represent  a net  elapsed  time  of  12.8  x 10~3  seconds 
(12.8  ms).  The  AMPSPC  algorithm  then  uses  this  input  array 
to  compute  the  Discrete-Fourier  Transform  ( DFT)  of  the 
digitized  time-varying  speech  signal.  The  result  of  this 
computation  is  the  magnitude  of  each  complex  number  in  the 
frequency  domain.  Also,  each  point  in  the  FFT  array  is 
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an  integral  multiple  of  78.125  Hz  (10  kHz/128  samples). 

Since  the  digitized  input  signal  to  the  FFT  is  composed 
of  real  numbers,  the  real  part  of  the  FFT  is  symmetric  about 
the  folding  frequency  (one-half  the  sampling  frequency) . 
Also,  the  magnitudes  of  the  FFT  elements  are  symmetric  about 
the  folding  frequency.  Therefore,  although  128  samples  were 
used  to  calculate  the  128-point  DFT , the  conjugate  symmetry 
property  of  the  FFT  guarantees  that  only  the  first  64  trans- 
formed components  are  necessary  to  represent  the  frequency 
spectrum  for  each  12.8  ms  time  interval  of  the  original 
analog  speech  signal.  Figure  4 illustrates  the  application 
of  the  FFT  technique  to  an  analog  speech  signal. 

Data  Storage 

The  medium  selected  to  store  the  FFT  speech  data  was  a 
magnetic  library  tape  (L-tape)  which  is  compatible  with  the 
input/output  options  of  the  Cyber/6600  computer.  Since 
this  L-tape  was  stored  at  the  ASD  Computer  Center,  access 
to  it  from  the  AFIT  processing  center  was  very  convenient. 
One  L-tape  per  speaker  was  created  to  avoid  the  confusion 
of  having  all  five  speakers  on  one  or  two  L-tapes.  This 
allowed  ready  access  to  a specific  individual's  words  and/or 
sentences.  The  transfer  of  the  FFT  speech  data  to  the 
L-tape  completes  the  data  preprocessing  sequence. 


IV.  Signal  Processing 


After  the  analog  speech  signal  was  digitized  and  written 
onto  L-tapes  as  described  in  section  three,  it  was  readily 
accessible  for  subsequent  processing.  This  data  was  in  the 
form  of  a digitized  output  from  64  discrete  audio  filters 
each  having  a center  frequency  of  some  integral  multiple  of 
78.125  Hz.  Each  number  represents  the  averaged  output  of  a 
particular  filter  over  an  interval  of  12.8  milliseconds. 

Thus,  each  12.8  millisecond  sample  of  speech  was  represented 
by  a frequency  vector  having  64  components. 

Channel  Compression 

Due  to  the  fact  that  the  ear-brain  system  responds  to 
ratios  of  frequencies  rather  than  absolute  frequency  values, 
the  original  64-component  vectors  were  compressed  to  approxi- 
mate the  frequency  response  of  the  ear  (Ref  16:85). 

The  compression  was  implemented  in  the  following  manner. 
The  first  six  vector  components  with  center  frequencies  from 
78.125  Hz  to  468.750  Hz  were  left  unchanged.  The  remaining 
58  vector  components  were  separated  into  1/3  octave  groups. 
The  magnitudes  of  the  components  in  each  group  were  added 
together,  thus  weighting  the  values  at  the  high  end  of  the 
frequency  scale.  This  resulted  in  a 16-component  frequency 
vector  with  the  center  frequencies  shown  in  Table  IV. 

Another  effect  of  this  channel  reduction  is  somewhat 
analogous  to  a phono  equalization  or  preemphasis  curve.  The 
power  density  of  the  high  frequency  information  is  boosted 
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to  approximately  match  the  power  density  of  the  low  frequency 
information. 

Spectrogram  Development 

After  the  speech  data  was  compressed,  the  frequency 
vectors  were  processed  by  the  analysis  portions  of  the  recog- 
nition scheme.  However,  it  is  helpful  to  be  able  to  look  at 
the  speech  data  in  a format  which  allows  a visual  analysis. 
Much  work  has  been  done  in  visual  speech  analysis  by  Potter, 
Kopp  and  Green  (Ref  20).  They  found  that  there  were  suf- 
ficient visual  clues  in  a time-frequency  spectrogram  to  allow 
trained  personnel  to  do  a remarkably  accurate  job  of  inter- 
preting the  original  speech. 

To  transform  the  16-component  speech  vectors  into  a form 
which  would  resemble  a speech  spectrogram,  a two-dimensional 
printing  scheme  was  used.  The  printing  scheme  adopted  plots 
the  numerical  magnitudes  of  each  component  of  the  frequency 
vector  on  one  axis  and  the  time  of  the  occurrence  on  the 
other  axis.  An  overprint  arrangement,  which  causes  the 
representation  of  the  frequency  component  to  become  in- 
creasingly dark  as  the  component's  magnitude  increases,  was 
used  to  produce  the  plots.  Figure  5 shows  an  example  of  a 
speech  sample  along  with  its  representative  spectrogram. 

The  speech  spectrograms  obtained  by  this  process  closely 
mimic  the  frequency- time  spectrograms  used  by  Potter,  Kopp 
and  Green.  A complete  description  of  the  spectrogram 
overprint  scheme  is  given  in  Appendix  E. 
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The  computer  program,  which  performs  the  frequency  re- 
duction, preemphasis,  and  spectrogram  output,  is  listed  in 
Appendix  B as  0CTAVE1.  In  addition,  OCTAVEl  stores  the 
reduced  data  onto  a computer  L-tape  to  be  accessed  by  later 
stages  of  the  recognition  process. 

Due  to  the  very  nature  of  speech,  an  utterance  can  con- 
tain phonemes  with  large  amounts  of  energy  next  to  phonemes 
containing  less  energy.  These  lower  energy  phonemes  may  not 
have  enough  energy  to  show  up  in  the  spectrogram  previously 
described,  and  be  missed  even  though  they  contain  valuable 
information.  To  reconcile  this  problem,  a normalization 
procedure  was  performed  on  each  frequency  vector.  This  is 
analogous  to  a conventional  automatic  gain  control  circuit. 
Previous  work,  done  by  Neyman  (Ref  19)  , Hensley  (Ref  13)  , and 
Guyote  and  Sisson  (Ref  11)  emphasized  the  importance  of  data 
normalization. 

To  accomplish  the  normalization  procedure,  the  16- 
component  frequency  vectors  produced  by  OCTAVEl  were  manipu- 
lated vector  by  vector.  Each  frequency  vector  was  normalized 
as  follows.  The  magnitude  of  the  vector  was  computed  by: 

16  2 l 

g = ( z s,  V \ 

i=l  1 

where  the  s^ ' s are  the  frequency  vector  components.  The 

column  was  then  normalized  by  replacing  each  vector  component 

* * 

si  si  ' where  si  = Sj/G*  This  insured  that  the  energy 
of  each  frequency  vector  was  equal  to  one.  To  further 
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emphasize  the  low  energy  phonemes,  the  new  frequency  com- 

* 

ponents  (s^  ) were  multiplied  by  10  to  ensure  they  would  be 
depicted  in  the  output  of  the  normalized  spectrogram. 

To  eliminate  the  effect  of  noise  being  amplified  and 
overprinted  in  the  spectrogram,  the  value  of  G was  tested. 

If  the  value  of  G was  less  than  a number  calculated  as  the 
average  magnitude  of  the  noise  level,  the  vector  was  not 
normalized  and  was  assigned  a magnitude  of  1.0.  Frequency 
vectors  with  magnitudes  of  1.0  were  too  small  to  be  repre- 
sented by  a character  in  the  overprint  scheme. 

A comparison  of  the  spectrogram  produced  by  OCTAVEl  and 
its  column  normalized  version  is  shown  in  Figure  6.  As  can 
be  seen,  the  normalized  spectrogram  provides  a more  complete 
representation  of  the  speech  data.  For  example,  the 
normalized  spectrogram  representation  of  the  word  "debt"  in 
Figure  6 clearly  shows  the  ending  "t". 

The  program  which  implements  this  normalization  pro- 
cedure is  called  OCTAVE 2 and  its  listing  appears  in 
Appendix  B.  This  program  produces  a normalized  spectrogram 
for  use  in  the  visual  phoneme  selection  analysis,  whereas 
the  L-tapes  produced  by  OCTAVEl  were  used  by  the  correlation 
program. 

Data  Base 

Due  to  the  fact  that  the  preprocessing  phase  is  quite 
lengthy  and,  at  the  Wright-Patterson  computer  facility,  can 
take  up  to  two  weeks  to  obtain  results,  the  investigation 
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Figure  6.  Normalized  vs  Non-normalized  Spectrograms 
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was  done  in  two  segments.  First,  a set  of  averaged  proto- 
type phonemes  was  selected  from  various  word  groups  spoken 
by  the  authors.  Then  each  of  the  averaged  prototype  phonemes 
was  correlated  with  several  sentence  samples  spoken  by  three 
different  speakers.  These  sentence  samples  were  composed  to 

) 

insure  that  the  phonemes  of  interest  were  represented. 

The  first  set  of  data  is  presented  in  Tables  V and  VI. 

Table  V shows  the  various  phoneme  sounds  that  were  selected 
for  analysis  in  this  research.  It  also  lists  the  groups  of 
14  words  containing  the  desired  phoneme.  From  these  words 
the  specific  phoneme  was  selected  and  averaged  to  give  the 
prototype  phoneme. 

The  sentences  listed  in  Table  VI  were  used  for  verifying 
the  averaged  prototype  phonemes  selected  from  the  word  groups. 

The  first  two  sentences  are  composed  of  words  from  Table  V. 

The  last  two  sentences  contain  phoneme  sounds  like  those 
listed  in  Table  V.  This  test  group  of  sentences  was  used 
to  verify  that  the  averaged  prototype  phonemes  selected  from 
the  authors'  word  groups  would  identify  similar  phonemes 
appearing  in  their  continuous  and  discrete  speech. 

To  aid  in  the  selection  of  the  averaged  prototype 
phonemes , each  author  recorded  the  words  in  Table  V and 
sentences  in  Table  VI  according  to  the  following  format.  The 
words  in  each  word  group  were  spoken  discretely  and  a five- 
second  2kHz  tone  was  recorded  to  mark  the  beginning  and  end 
of  each  word  group.  The  speaker  then  recited  each  of 
sentences  by  first  saying  it  discretely  and  then  conti. 
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Table  V 

Phoneme  Word  Groups 


"B"  Sound 

"D"  Sound 

"R"  Sound 

"T"  Sound 

bay 

debt 

rat 

taker 

babble 

debit 

read 

terminate 

batter 

ditto 

ride 

tide 

be 

donut 

robe 

tight 

bench 

dug 

rut 

toad 

bitter 

dust 

rhino 

tore 

bite 

drafted 

rather 

tub 

boat 

danger 

rear 

tube 

bought 

dagger 

right 

through 

by 

dread 

resist 

tither 

butter 

de  ad 

rand 

tribe 

blend 

dodge 

rover 

tip 

bright 

dude 

rare 

twist 

bulb 

day 

rubber 

trade 

"A"  Sound 

"AUH"  Sound 

"E"  Sound 

"0"  Sound 

hate 

among 

leave 

go 

Abraham 

about 

each 

so 

hay 

American 

me 

blow 

range 

topeka 

see 

obey 

same 

santa 

even 

omit 

terminate 

mascera 

leech 

over 

wave 

another 

beat 

note 

shape 

Caruso 

meet 

those 

trace 

appear 

sleep 

pose 

angel 

attempt 

valley 

rose 

may 

accumulate 

reek 

nose 

ray 

associate 

key 

most 

say 

approximate 

egress 

both 

lay 

against 

ego 

no 

Table  VI 

Verification  Sentence  Groups 


Abraham  drafted  a note. 

See  me  wave  at  my  associate. 


A boy  got  out  the  back  gate. 

Joe  was  seen  around  the  airplane . 


Similarly,  each  sentence  was  separated  from  the  other  with  a 
five-second  2kHz  tone. 

The  other  set  of  data  consisted  of  the  two  groups  of 
sentences  listed  in  Table  VII.  The  first  six  sentences  con- 
tain words  included  in  Table  V.  The  last  six  sentences 
consist  of  words  not  appearing  in  Table  V but  having  similar 
sounding  phonemes.  This  data  set  was  spoken  by  three  dif- 
ferent speakers.  The  purpose  of  this  data  was  to  investigate 
how  well  the  averaged  prototype  phonemes  could  identify 
similar  phonemes  in  speech  from  different  speakers. 

To  keep  the  data  between  speakers  separate,  each  would 
say  a sentence  discretely  and  then  continuously  until  he  had 
completed  the  sentences  in  Table  VII.  The  five-second  2kHz 
tone  was  again  used  to  mark  the  beginning  and  end  of  each 
sentence . 

All  the  speakers  involved  were  male  and  from  different 
parts  of  the  country  so  some  dialect  influences  were  present 
in  their  speech.  Each  data  set  was  recorded  and  processed 
as  described  previously  and  stored  in  the  16  channel  reduced 
form. 

Phoneme  Selection 

Since  the  production  of  a complete  set  of  phonemes  was 
not  a goal  of  this  investigation,  only  the  phonemes  listed 
in  Table  V were  pursued.  The  selection  of  the  phoneme's 
sound  was  facilitated  by  the  use  of  the  normalized  spectro- 
gram and  the  pictorial  representations  of  the  phonemes  from 


Table  VII 


Test  Sentence  Groups 


Abranam  drafted  a note. 

See  me  wave  at  my  associate. 

The  batter  dug  into  the  dust  and  made  a rut  the  shape 
of  his  foot. 

No  note  to  terminate  the  leave  of  the  American  called 
Caruso  was  drafted  this  day. 

The  bright  bulb  formed  a ray  that  made  a trace  of  the 
rubber  rat. 

From  the  boat  docked  in  the  bay,  we  saw  the  rhino,  leech, 
and  toad  as  they  lay  dead  along  the  tide. 


Before  the  trip,  the  rabbit  rested  along  the  open  field 
of  the  rancher. 

A boy  got  out  the  back  gate. 


Does  Dennis  teach  reading  or  does  Dennis  teach 
driving? 

Joe  was  seen  around  the  airplane. 

Take  a closer  look  at  Eastman  Kodak's  bubbling  reagents 
for  photo-resist  stripping. 

Each  person  at  Beckman  sees  his  responsibility  aimed 
toward  fabricating  better  resistors,  displays,  and 
drugs. 


Potter,  Kopp  and  Green  (Ref  20) . The  normalized  spectrogram 
of  each  word  group  was  compared  with  the  pictorial  represen- 
tation of  a particular  phoneme  during  this  process.  Once 
found,  the  location  of  a phoneme  was  recorded  by  noting  the 
time  values  printed  on  the  spectrogram.  These  time  values 
were  used  during  the  phoneme  extraction  process.  This  pro- 
cedure was  implemented  for  analyzing  the  spectrograms  of  the 
authors'  speech. 

The  lengths  of  the  phonemes  were  selected  to  minimize 
the  transitions  between  phonemes.  However,  each  specific 
phoneme  was  selected  to  be  as  long  as  possible  within  the 
above  constraint.  Also,  the  phonemes  selected  from  each 
respective  group  of  words  were  chosen  to  be  of  the  same 
length  so  that  they  could  be  averaged  together. 

Phoneme  Extraction  and  Averaging 

During  the  analysis  of  the  word  groups,  the  locations 
of  the  target  phonemes  were  recorded.  This  produced  a time 
of  occurrence  listing  for  each  of  the  14  target  phonemes  in 
a particular  word  group.  This  list  was  then  incorporated 
into  a program  called  PUNCH,  which  produced  a set  of  punched 
cards  corresponding  to  the  time  of  occurrence  of  the  14 
target  phonemes  in  each  group  of  words.  A listing  of  pro- 
gram PUNCH  appears  in  Appendix  B . 

For  each  of  the  authors,  this  process  resulted  in  a set 
of  punched  cards  consisting  of  14  target  phonemes  for  each 
of  the  eight  word  groups.  Thus,  by  combining  the  results 
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for  both  authors,  a set  of  punched  cards  for  28  target  phonemes 
was  produced  for  each  of  the  desired  phonemes. 

Since  these  28  target  phonemes  were  all  of  the  same 
length,  they  could  be  averaged  together.  The  program  that 
performs  this  averaging  process  is  called  PROAVE  and  its 
listing  appears  in  Appendix  B. 

For  each  frequency  vector  in  a particular  phoneme,  the 
program  sums  up  the  28  target  phoneme  components  and  divides 
by  28  to  give  an  averaged  value  for  the  component.  This 
process  is  illustrated  in  Figure  7.  This  results  in  an 
averaged  prototype  phoneme  which  will  now  be  referred  to  as 
a prototype  phoneme.  For  each  of  the  desired  phonemes,  this 
averaging  process  was  performed  and  resulted  in  a set  of  eight 
prototype  phonemes. 

Phoneme  Analysis 

Since  these  eight  prototype  phonemes  were  formed  by 
averaging  like  phonemes  from  two  speakers  to  yield  an  overall 
representation  of  the  desired  phoneme,  they  should  be  able  co 
identify  similar  phonemes  spoken  by  the  same  speakers.  To 
facilitate  the  selection  of  a set  of  optimum  phonemes  to  do 
this  process,  the  averaged  prototype  phonemes  were  varied  in 
length  and  correlated  with  the  groups  of  words  they  were 
selected  from.  The  correlation  process  is  discussed  in  the 
next  section. 

Each  prototype  phoneme  was  varied  in  length  to  yield 
nine  samples  of  the  prototype.  To  help  with  this  process, 
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OCTAVE 2 was  modified  to  accept  punched  cards  and  give  a 
normalized  spectrogram.  From  the  results  of  the  correlations 
of  these  nine  samples  of  the  prototypes  with  the  words  they 
came  from,  a subset  of  the  three  best  prototype  variations 
was  selected. 

These  three  variations  were  then  correlated  with  the 
sentences  in  Table  VI.  This  resulted  in  selecting  an  optimum 
prototype  phoneme  that  yielded  the  best  results  in  identi- 
fying the  desired  phoneme  in  both  continuous  and  discrete 
speech.  The  final  result  of  these  correlations  was  a set 
of  eight  optimum  prototype  phonemes. 

These  eight  prototype  phonemes  were  then  correlated 
with  the  sentences  in  Table  VII.  This  stage  of  processing 
was  done  to  determine  whether  the  averaged  prototype  phoneme 
set  was  characteristic  of  similar  phoneme  sounds  in  the 
speech  of  others.  The  results  of  this  analysis  is  discussed 
in  a later  section. 


V.  Recognition  Processing 


The  recognition  processing  phase  consists  of  performing 
a running  crosscorrelation  of  each  averaged  prototype  phoneme 
with  the  sentence  samples.  Each  averaged  prototype  phoneme 
consists  of  an  N x 16  array  of  frequency  vector  components, 
where  N is  the  length  of  a particular  phoneme.  Similarly, 
each  sentence  sample  consists  of  an  M x 16  array  of  frequency 
vector  components,  where  M is  the  length  of  a particular 
sentence.  For  this  research,  it  was  found  that  establishing 
an  upper  limit  on  M equal  to  700  was  adequate  for  all  the 
sentences  analyzed.  This  equates  to  an  utterance  8.96  seconds 
long.  The  output  of  this  correlation  calculation  is  an  M x R 
array  where  M,  as  defined  above,  is  the  length  of  the  sentence 
sample  and  R is  the  number  of  phonemes  contained  in  the  pro- 
totype phoneme  set.  The  value  of  each  element  in  the  M x R 
array  is  the  result  of  the  correlation  of  a particular  pho- 
neme with  the  sentence  sample  at  a particular  instant  of  time. 

In  order  to  implement  the  correlation  computation,  it  was 
necessary  to  prepare  both  the  phonemes  and  sentence  samples 
with  the  following  operations:  column  normalization , array 
augmentation,  and  phoneme  unit  normalization.  The  program 
used  to  perform  the  correlation  computations  was  called 
CRSCOR  and  its  listing  appears  in  Appendix  B. 

Column  Normalization 

An  extremely  important  aspect  of  the  recognition  pro- 
cessing phase  is  the  normalization  of  the  data  (Ref  11,  13,  19). 


The  purpose  of  normalization  is  to  minimize  the  effect  of 
speaker  variation  and  provide  a basis  upon  which  a decision 
scheme  could  be  devised.  The  prototype  phonemes  and  the 
sentence  samples  were  column  normalized  at  each  time  incre- 
ment. Each  component  of  the  16-channel  frequency  vector  was 
normalized  according  to  the  formula: 

16  2 * 

=x/[i  u rr 

3 i=l  1 

From  the  spectrograms  of  the  sentences,  it  was  observed 
that  the  magnitude  of  the  non-information  bearing  frequency 
vectors  between  the  words  was  limited  to  an  approximate  value 
of  0.5.  To  prevent  this  information  from  entering  the  cor- 
relation calculation,  the  magnitude  of  each  frequency  vector 
of  a particular  sentence  sample  was  tested  by  the  following 
inequality  prior  to  the  column  normalization  calculation: 


16 
[ E 
i=l 


(xi) 2]h  <0.5 


If  the  inequality  was  satisfied,  the  frequency  vector  was 
not  column  normalized,  but  instead  was  assigned  a magnitude 
of  0.001  to  insure  that  the  correlation  values  for  these 
components  were  very  small  numbers. 

The  column  normalization  calculation  was  the  only 
normalization  performed  on  the  sentence  samples.  In  addition 
to  being  column  normalized,  the  phoneme  arrays  were  unit 
normalized  after  their  Discrete  Fourier  Transform  ( DFT) 


calculation. 


Array  Augmentation 


The  motivation  of  this  research  was  directed  toward  the 
correlation  of  a two-dimensional  averaged  prototype  phoneme 
with  its  "variations"  that  occur  in  everyday  speech.  Since 
real-time  correlation  calculations  require  enormous  amounts 
of  computations,  even  large-scale  computers,  such  as  the  CDC 
Cyber  6600,  would  require  excessive  amounts  of  time  to  do  the 
calculations.  Therefore,  the  computer  algorithm  used  in  this 
research  was  based  on  the  DFT. 

The  recent  innovations  in  the  past  ten  years  for  compu- 
ting the  DFT  of  matrices  such  as  the  Fast  Fourier  Transform 
( FFT)  have  made  it  possible  to  greatly  reduce  the  amount  of 
computations  needed  for  correlation  (Ref  1,  2,  12).  However, 
the  use  of  DFT  theory  requires  that  certain  inherent  problems 
be  considered.  The  most  critical  problems  are  aliasing, 
leakage,  and  end-effect. 

Aliasing  is  a term  that  refers  to  the  fact  that  high- 
frequency  components  of  a time  function  can  impersonate  low 
frequencies  if  the  sampling  rate  is  too  low  (Ref  2) . This 
problem  was  avoided  during  the  digitization  process  by  using 
a 10  kHz  sampling  rate  which  was  twice  the  highest  speech 
frequency  (5  kHz) . 

The  problem  of  leakage  is  inherent  in  the  Fourier  analysis 
of  any  finite  record  of  data.  Furthermore,  leakage  is  directly 
related  to  the  method  by  which  the  digitized  samples  of  an 
analog  signal  are  selected  or  windowed.  The  ideal  window 
function  is  one  that  would  localize  the  contribution  of  a 


given  frequency  in  a narrow  main  lobe  while  reducing  the 
amount  of  "leakage"  through  the  side  lobes.  It  is  well  known 
that  both  of  these  criteria  cannot  be  optimized  simultaneously 
and  that  the  selection  of  a window  function  is  a compromise 
between  leakage  and  the  width  of  the  main  lobe.  Neyman 
tested  the  Hanning  and  rectangular  window  functions  and  re- 
ported that  the  overall  recognition  results  were  not  altered 
when  either  window  function  was  used  (Ref  19) . The  rectangular 
window  function  was  used  in  this  research  since  it  was  easiest 
to  implement. 

The  problem  referred  to  as  end-effect  occurs  when  two 
functions  are  correlated  because  of  the  periodicity  imposed 
by  the  DFT . The  correlation  computations  done  in  this  re- 
search required  that  a buffer  be  included  in  the  transformed 
functions  so  that  the  function  which  is  being  moved  along  the 
time  axis  does  not  encounter  duplicates  of  the  data  being 
correlated.  This  problem  was  solved  by  augmenting  the  arrays 
in  the  following  manner. 

Let  be  the  prototype  phoneme  array  and  be  the 
sentence  array.  Let  P be  the  number  of  points  defining  the 
length  of  P„  and  S the  number  of  points  defining  the  length 

of  S.  ..  Choose  a V such  that: 

ID 

V >.  P + s - 1 


and 


V = 


2 


n 
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where  n is  an  integer. 

The  augmented  arrays  Skb  and  Pkb  for  and  P„  re- 

spectively, are  defined  as  follows: 


S 


kb 


0 k = 0 , 1,  2,...,  V-S 
b=  0,  1,  2,...,  15 

s.  . k = V-S  +1,  V-S  + 2,  . . . , V-l 
13  b = j = 0,  1,  2,  . . . , 15 
i = 0,  1,  2 1 , S— 1 

0 k = 16,  17, ...  , V-l 
b = 16,  17, ... , 31 


p..  k=i=0,  1,  2,...,  P-1 
13  b = j = 0,  1,  2, . . . , 15 

0 k-P,  P+1,...,  V-l 
b=  0,  1,  2,...,  15 

0 k = 0,  1,  2, ...  , V-l 
b = 16,  17, ...  , 31 


The  array  transformation  is  illustrated  in  Figure  8. 

The  augmented  arrays  serve  to  embed  the  prototype 
phoneme  and  sentence  arrays  in  a sufficient  buffer  of  zeroes 
to  eliminate  the  end-effect  problem.  Furthermore,  the 
augmented  arrays,  which  are  both  32  x 64  arrays,  can  be 
correlated  to  yield  a 32  x 64  array. 

The  limiting  values  of  V and  S were  determined  by  the 
AFIT  FFT  subroutine  (FOURT)  (Ref  12) . For  this  research,  V 
was  limited  to  64  and  S to  48.  Since  the  size  of  the  sen- 
tence samples  were  fixed,  it  was  necessary  to  limit  the  size 
of  the  prototype  phonemes  to  affect  a complete  correlation. 
This  situation  was  reconciled  by  incorporating  an  overlap 
variable,  T,  into  the  structure  of  the  augmented  sentence 
array.  As  each  sequential  sentence  array  was  augmented, 
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only  (S-T)  new  values  were  included  in  the  array.  The  last 
T values  of  the  previous  sentence  sample  became  the  first  T 
values  of  the  new  sentence  sample.  For  this  research,  T was 
fixed  at  8 and  this  meant  that  the  length  of  the  largest 
prototype  phoneme  was  limited  to  15  to  insure  that  the  over- 
lap was  greater  than  half  the  length  of  the  largest  proto- 
type phoneme. 

Fast  Fourier  Transform 

Following  the  augmentation  of  both  the  prototype  phoneme 
and  sentence  arrays,  their  two-dimensional  DFT  was  computed 
using  the  FFT  algorithm,  FOURT  (Ref  12) . The  transformed 
arrays  were  calculated  as  follows: 


P 


rs 


S 


rs 


K L 
Z Z 
k=l  b=l 


K L 
Z Z 
k=l  b=l 


pkb  exp | (-2 jir) 

skfc  exp  [(-2  jit)  ♦ !£)] 


where  j = /^l 

The  complex  conjugate  of  the  transformed  prototype 
phoneme  array  was  then  formed  as  P*rg.  The  phoneme  array 
was  then  ready  for  the  unit  normalization  process. 

Unit  Normalization 

The  prototype  phoneme  array  was  column  normalized  before 
the  DFT  computation.  The  column  normalization  calculation, 
as  discussed  earlier,  insured  that  the  energy  of  each  fre- 
quency vector  was  unity.  However,  the  net  energy  of  each 


prototype  phoneme  remained  a direct  function  of  its  length. 
As  a result,  the  correlation  value  for  a perfect  match  be- 
tween a prototype  phoneme  and  a candidate  phoneme  in  a given 
sentence  sample  could  be  compromised  by  a long  prototype 
phoneme.  Unit  normalization  was  done  to  insure  that  each 
prototype,  no  matter  what  its  length,  had  unit  energy  prior 
to  the  correlation  computation. 

Unit  normalization  was  computed  as  follows: 


where 


p*Nrs  * p*rs  ' (Energy)'’ 


32  64  _ 

Energy  = [ E Y.  (p*  ) ] 

r=l  s=l 


It  is  noted  here  that  the  energy  computed  above  is  the 
energy  of  the  prototype  phoneme  after  column  normalization. 
Since  there  are  N columns  of  unit  energy,  the  net  energy  of 
a given  column  normalized  prototype  phoneme  is  N.  Thus,  to 
unit  normalize  a prototype  phoneme,  each  element  of  the 
column  normalized  array  is  divided  by  N*5.  The  significance 
of  this  particular  calculation  with  respect  to  correlation 
will  be  discussed  in  the  next  section. 


Correlation  Computations 

After  the  unit  normalization  of  the  prototype  phoneme 
array,  the  element-by-element  product  was  computed  as: 


Z = S • P* 
rs  rs  Nrs 


The  result  of  this  multiplication  is  equivalent  to 
correlation  in  the  time  domain.  The  desired  correlation 
values  were  obtained  by  computing  the  inverse  transform  of 
Z . The  inverse  transform  was  computed  as  follows: 


K L 
1 E 
k=l  b=l 


exp  [2  j tt  (- 


Following  the  inverse  transform  computations,  the  cor- 
relation vector  for  a particular  phoneme  was  formed  by 
taking  the  first,  or  zero  shift,  row  from  the  z^  array. 

This  row  was  transferred  to  the  correlation  array  as  follows: 

C.  = z.  . 

1 kb 

where 

k = S,  S + 1,...,  V-T 
b = 1 

i = 1,  2,...,  R (The  particular  phoneme  cor- 
related with  the  sentence) 

The  first  (S  - 1)  values  were  discarded  to  compensate  for 
the  end-effect.  The  last  T values  account  for  the  overlap 
factor. 

Before  the  correlation  array  could  be  used  in  a decision 
scheme  a basis  for  comparing  the  correlation  values  over  all 
time  for  all  prototype  phonemes  had  to  be  developed.  Ob- 
viously a larger  prototype  phoneme  will  have  a greater 
maximum  correlation  value  when  it  encounters  a large 
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candidate  phoneme  than  will  a short  prototype  phoneme.  It 
would  be  highly  desirable  to  normalize  the  maximum  correlation 
values  to  unity  so  that  the  performance  of  all  prototype 
phonemes  could  be  compared.  Since  the  prototype  phonemes 
were  column  and  unit  normalized  and  the  sentence  samples 
were  column  normalized,  the  maximum  correlation  obtainable 
by  a prototype  phoneme  which  encounters  an  exact  replica  of 

. L 

itself  would  be  N , where  N is  the  length  of  the  prototype 
phoneme.  A mathematical  derivation  of  this  fact  is  pre- 
sented in  Appendix  D. 

It  was  possible  to  ensure  that  the  maximum  correlation 
value  for  any  prototype  phoneme  was  unity  by  simply  dividing 
the  correlation  values  for  each  prototype  phoneme  by  the 
square  root  of  the  length  of  the  prototype  phoneme  (N*5). 

Since  the  computed  correlation  values  for  any  prototype 
phoneme  will  be  restricted  between  zero  and  unity,  the  rela- 
tive performance  of  all  prototype  phonemes  can  be  compared 
and  evaluated  in  a decision  scheme. 

Data  Storage 

Following  the  completion  of  the  correlation  computations, 
the  results  were  stored  in  permanent  file  in  the  form  of  an 
M x R array  where  M is  the  length  of  the  particular  sentence 
sample  and  R is  the  number  of  phonemes  contained  in  the 
prototype  phoneme  set.  In  this  form,  each  element  in  the 
correlation  array  represents  the  correlation  of  a particular 
prototype  phoneme  with  the  sentence  sample  at  a particular 
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instant  of  time.  The  structure  of  the  array  is  illustrated 
in  Figure  9. 
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Figure  9.  Correlation  Array 

In  addition,  this  mode  of  storage  provided  optimum  flexibility 
for  exercising  the  decision  scheme  during  its  development. 

Correlation  Plot  Output 


Insight  into  the  development  of  a decision  scheme  was 
enhanced  from  the  analysis  of  the  results  of  the  correlation 
routine.  A plotting  routine  was  designed  that  permitted  the 
selection  of  a prototype  phoneme's  correlation  values  to  be 
sent  to  the  Calcomp  plotter  for  processing.  This  routine 
graphically  depicts  the  running  correlation  of  a particular 
sentence  or  word  group. 

Figure  10  shows  the  output  of  the  ”B"  prototype  as  it 
was  correlated  with  the  words:  "Bench,  Bitter,  and  Bite". 
The  word  group  started  at  time  interval  20,  and  the  "B" 
phoneme  correlated  with  the  beginning  "B"  of  each  word.  The 
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program  that  plots  the  correlation  values  was  called 
FPLOT  and  its  listing  appears  in  Appendix  B. 


VI.  Decision  Scheme 


The  purpose  of  implementing  a decision  scheme  was  to 
locate  the  areas  in  a given  sentence  correlation  array  where 
a particular  phoneme  might  have  occurred.  The  organization 
of  the  correlation  array  facilitated  a visual  comparison  of 
all  prototype  phoneme  correlations  by  comparing  their  mag- 
nitudes in  each  row  of  the  array.  In  addition,  the  dynamic 
performance  of  each  prototype  phoneme  within  a given  sen- 
tence was  readily  analyzed  from  the  correlation  plot  output. 

As  a result,  the  final  decision  scheme  tested  the  correlation 
array  values  against  three  criteria  to  arrive  at  a phoneme's 
location  and  identification.  The  three  criteria  used  were: 
threshold,  rate-of-change  of  correlation  values,  and  en- 
durance. They  are  defined  as  follows: 

Threshold 

The  correlation  array  was  first  processed  for  magni- 
tudes which  were  greater  than  or  equal  to  a selected  threshold. 
Amplitudes  satisfying  the  threshold  criteria  were  left  un- 
changed; all  others  were  set  equal  to  zero.  This  threshold 
operation  can  be  visualized  by  drawing  a horizontal  line  on 
a given  correlation  plot  as  shown  in  Figure  11. 

Thus,  an  appreciation  for  when  a phoneme  occurred  can 
be  gained  by  observing  the  peaks  that  lie  above  the  threshold 
level.  For  this  research,  a threshold  level  of  0.6  was  used. 
This  threshold  level  was  experimentally  determined  to  be 
just  above  the  average  correlation  value  level  for  all  the 


Figure  11.  Threshold  Criteria  Illustration 
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speech  data  processed.  In  addition,  this  level  permitted  the 
largest  number  of  prototype  phoneme  correlation  values  to  be 
selected  for  testing  by  the  other  criteria  in  the  decision 
scheme. 

Rate-of-Change  of  Correlation  Values 

An  analysis  of  the  correlation  array  data  indicated  that 
if  a valid  phoneme  was  located  in  a given  sentence,  the  cor- 
relation magnitudes  above  the  threshold  level  did  not  change 
dramatically  with  respect  to  time.  On  the  other  hand,  when 
an  invalid  phoneme  had  correlation  values  above  the  threshold 
level,  the  rate-of-change  between  correlation  values  as  time 
increased  was  more  dramatic.  Thus,  for  this  research,  cor- 
relation values  with  magnitudes  above  the  threshold  level 
were  left  unchanged  if  each  preceeding  and  successive  cor- 
relation value  was  within  61  percent  of  the  value  being 
tested.  Otherwise,  those  correlation  values  above  the 
threshold  level  not  satisfying  the  above  rate-of-change 
criteria  were  set  to  zero.  The  61  percent  rate-of-change 
value  was  experimentally  determined  such  that  it  deleted 
only  isolated  correlation  values  that  were  surrounded  by 
zeroes,  but  would  preserve  groupings  of  two  or  more  cor- 
relation values. 

Endurance 

To  further  limit  the  opportunity  for  false  hits,  a 
time  endurance  criteria  was  incorporated  into  the  decision 
scheme  to  eliminate  the  momentary  correlation  values  that 
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rise  above  the  threshold  level  and  satisfy  the  rate-of- 
change  criteria.  The  endurance  criteria  was  implemented  by 
scanning  the  correlation  array  from  beginning  to  end  for  a 
particular  phoneme.  When  a correlation  value  above  the 
threshold  level  was  detected,  a marker  was  set.  When  the 
correlation  value  fell  below  the  threshold  level,  another 
marker  was  set.  Then,  the  time  increment  between  the  two 
markers  was  compared  to  some  specified  percentage  of  the 
length  of  the  prototype  phoneme,  for  example,  one-half  its 
length.  If  the  time  increment  of  the  "hit"  was  less  than 
the  "adjusted"  length  of  the  prototype  phoneme,  that  portion 
of  the  correlation  array  between  the  markers  was  set  to  zero. 

Thus,  following  the  endurance  processing,  the  corre- 
lation array  for  a particular  prototype  phoneme  with  a given 
sentence  consisted  of  those  values  above  a desired  threshold 
level  which  did  not  change  from  value  to  value  by  more  than 
61  percent,  and  which  stayed  above  this  threshold  for  an 
amount  of  time  dependent  on  the  prototype  phoneme's  length. 

Ranking 

Following  the  threshold,  rate-of-change , and  endurance 
processing,  the  correlation  array  was  ready  for  the  final 
decision  output.  It  might  seem  logical  to  merely  select  the 
largest  correlation  value  at  each  time  increment  and  use  the 
corresponding  prototype  phoneme  as  the  final  selection. 
However,  previous  research  efforts  cite  the  conclusion  that 
nominally  correct  phonemes  are  not  always  produced  with  a 
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high  frequency  of  occurrence  in  normal  speech  and  that 
higher  order  decision  schemes  are  used  by  the  brain  to 
determine  the  actual  word  content  of  a sentence  (Ref  11,  13, 
19) . As  a result,  it  was  decided  to  implement  a ranking 
algorithm  which  would  simply  print  up  to  eight  phoneme 
selections  for  each  time  increment  by  listing  the  selections 
from  the  highest  to  lowest  correlation  values.  The  advan- 
tages to  this  output  format  are: 

1.  It  could  be  stored  and  used  by  additional  decision 
schemes  that  take  into  account  higher  order  levels 
of  word  structure  such  as  syntax,  grammar,  and 
context. 

2.  The  relative  performance  of  each  of  the  prototype 
phonemes  could  be  readily  analyzed.  Invalid 
decisions  could  be  noted  so  as  to  determine  methods 
which  would  yield  more  correct  results. 

This  was  the  final  computer  processing  stage  in  this 
research.  An  illustration  of  the  overall  decision  processing 
scheme  is  shown  in  Figure  12.  The  program  which  performs 
this  decision  scheme  is  called  DECIS  and  its  listing  appears 
in  Appendix  B. 


Threshold  Process 
Proto  1 (Length  4) 
Proto  2 (Length  6) 
Proto  3 (Length  5) 
Time 


.5  .7  .8  .9  .9  .8  .7  .5  .4 

.4  .7  .3  .5  .4  .8  .9  .9  .8 

.8  .9  .6  .5  .5  .4  .5  .4  .3 

23456789  10 


+ Threshold  Level  =0.6 


Rate-of-Change  Process 
Proto  1 (Length  4) 
Proto  2 (Length  6) 
Proto  3 (Length  5) 
Time 


.8  .9 

2 3 


.8  .9 

0 0 


.8  .7  0 

.8  .8  .9 

0 0 0 

7 8 9 


0 0 
9 .8 
0 0 
9 10 


+ 61  Percent  Rate-of-Change 


Endurance  Process 
Proto  1 (length  4) 
Proto  2 (Length  6) 
Proto  3 (Length  5) 
Time 


.8  .9 

2 3 


.8  .9 

0 0 


.8  .8 

0 0 


9 10 


+ Endurance  = 0.5  x Proto  Length 


Ranking  Process 
Proto  1 (Length  4) 
Proto  2 (Length  6) 
Proto  3 (Length  5) 
Time 


0 .7  .8 

0 0 0 
.8  .9  .8 

2 3 4 


.9  .8  .7  0 0 

0 .8  .8  .9  .8 

C 0 0 0 0 

6 7 8 9 10 


Phonemic  Output 


Proto  3 

Proto  3 Proto  1 
Proto  3 Proto  1 
Proto  1 
Proto  1 
Proto  1 

Proto  1 Proto  2 
Proto  2 Proto  1 
Proto  2 


VII.  Results 


The  results  obtained  will  be  presented  in  three  parts. 
The  first  part  consists  of  the  results  of  the  prototype 
phonemes  correlated  with  the  authors'  word  groups  used  to 
select  the  prototype  phonemes.  The  second  part  presents  the 
results  of  the  prototype  phonemes  correlated  with  the  verifi- 
cation sentences  spoken  by  the  authors.  The  final  part  con- 
tains the  results  of  the  prototype  phonemes  correlated  with 
sentences  spoken  by  three  different  speakers. 

Scoring  Philosophy 

As  previously  discussed,  the  decision  scheme  produced  a 
ranking  of  the  eight  correlation  values  in  descending  order 
for  each  speech  time  segment.  The  scoring  was  accomplished 
by  noting  the  relative  position  of  the  phoneme  in  the  word 
being  analyzed  and  its  ranking  in  the  decision  program's 
output.  If  the  phoneme  was  in  the  correct  position  within 
the  word  and  was  ranked  within  the  first  three  choices  in 
the  decision  program's  listing,  the  phoneme  was  considered 
to  be  "located".  If  the  phoneme  was  in  the  correct  position 
and  was  the  first  choice  in  the  decision  program's  listing, 
the  phoneme  was  considered  to  be  "identified".  Otherwise, 
the  phoneme  was  considered  to  be  missed. 

Only  the  eight  phonemes  under  study  were  scored.  If  a 
word  contained  other  phoneme  sounds,  they  were  not  scored. 

For  instance,  in  the  word  "the"  only  the  "Auh"  sound  was 
scored,  the  "TH"  sound  was  ignored  since  none  of  the  eight 
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phonemes  being  tested  resembled  it.  The  various  symbols 
used  in  the  scoring  process  are  listed  in  Table  XIX  in 
Appendix  C.  The  scoring  charts  for  the  words  and  sentences 
are  presented  in  Tables  XX  thru  XL  in  Appendix  C. 

Analysis  and  Calculations 

Evaluation  of  the  results  obtained  for  various  types  of 
speech  was  based  on  the  percent  correct  value,  Pc;  this  same 
criterion  was  used  by  prior  researchers  (Ref  11,  13,  19): 

Pc  = (A/B)  x 100% 

whe  re : 


A is  the  quantity  of  correct  phonemes 
"identified"  or  "located" 

B is  the  total  number  of  phoneme  patterns 
considered 

Thus,  P^,  is  the  percentage  of  the  phoneme  patterns  correctly 
"identified"  or  "located". 

The  binomial  distribution  was  used  in  developing  the 
error  rate  calculations  for  the  sentence  phoneme  analysis. 
The  use  of  this  distribution  rather  than  some  other  distri- 
bution was  determined  by  the  manner  in  which  the  phonemes 
were  located.  Since  the  sentence  phonemes  were  either 
"located"  or  "missed",  the  binary  values  one  and  zero  can 
be  used  to  represent  these  events.  This  binary  represen- 
tation of  the  sentence  phoneme  location  implies  a binomial 
distribution  for  the  error  rate. 
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But  as  the  events  are  increased  such  that  B approaches 
infinity  the  error  rate,  by  definition,  is: 


It  was  assumed  that  the  number  of  random  events  B was  large 


enough  so  that  PE  = Pg. 

If  the  error  rate  calculated  for  F misclassif ied  events 


then  F has  the  bi 


out  of  B possible  random  events  is  P 
nomial  distribution  (Ref  5:74). 


Thus  the  expected  value  of  the  error  rate  is 


The  variance  of  F for  B random  events  is 


VAR{F } = BP  (1-Pp) 


Thus  the  variance  of  the  error  rate  Pp  is 


VARCP,,} 


i 


The  95  percent  confidence  intervals  for  error-rate 
estimates  for  the  binomial  distribution  were  determined  from 
Figure  3.6  presented  in  the  text  written  by  Duda  and  Hart 
(Ref  5:75).  Confidence  intervals  give  statistical  bounds 
for  the  certainty  of  an  event.  Thus,  for  this  research  the 
95  percent  confidence  interval  is  the  interval  over  which 
the  error  rate  for  a given  sample  size  exists  95  percent  of 
the  time . 

The  summarized  results  for  each  of  the  three  data  groups 
are  presented  by  giving  the  percent  correct  for  each  group  of 
words  or  sentences  for  each  speaker,  and  a combined  score  for 
the  speakers.  The  total  number  of  correct  choices  and  the 
total  number  of  events  for  each  data  group  was  determined  so 
that  the  probability  of  being  correct,  probability  of  being 
in  error,  variance  of  the  error,  and  the  95  percent  confi- 
dence interval  for  the  probability  of  error  could  be  calcu- 
lated. This  information  is  presented  in  Tables  VIII,  X,  and 
XII.  In  addition,  the  detailed  scoring  of  each  of  the  word 
groups  or  sentences  is  presented  in  Appendix  C. 

Word  Groups 

The  results  for  the  eight  word  groups  are  summarized  in 
Table  VIII.  The  percent  correct  for  phoneme  location  and 
identification  for  each  word  group  is  tabulated  for  author  1, 
author  2,  and  the  combination  of  the  two.  The  first  three 
columns  present  the  results  for  the  particular  phoneme  that 
was  calculated  from  each  of  the  word  groups.  For  example, 
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Analysis  of  the  Word  Groups 


(continued) 


the  B phoneme  was  scored  with  the  B words.  The  last  three 
columns  are  the  results  when  all  eight  phonemes  were  scored 
with  the  words.  The  data  used  to  make  these  calculations  can 
be  found  in  the  corresponding  tables  in  Appendix  C. 

As  can  be  seen  from  Table  VIII,  the  net  probability  of 
locating  a particular  phoneme  for  both  authors  is  0.983.  The 
95  percent  confidence  interval  for  the  probability  of  error 
is  0.009  to  0.03.  Thus,  one  can  be  95  percent  confident  that 
for  247  trials  the  desired  phoneme  will  be  located  at  least 
97  percent  of  the  time.  There  was  an  88.2  percent  identifi- 
cation rate  for  these  same  trials,  which  produced  a 95  per- 
cent confidence  error  interval  of  0.06  to  0.17. 

When  all  phonemes  were  scored,  there  were  444  trials. 
This  yielded  a 95.9  percent  location  rate  that  corresponds 
to  a 95  percent  confidence  error  interval  of  0.035  to  0.10. 
Finally,  the  identification  rate  was  83.5  percent  which 
yielded  a 95  percent  confidence  error  interval  of  0.13  to 
0.20. 

Verification  Sentences 

The  sentences  used  for  verification  of  the  phoneme  set 
are  listed  in  Table  IX.  The  results  for  all  of  the  located 
and  identified  phonemes  are  presented  in  Table  X.  As  can  be 
seen  from  Table  X,  the  net  probability  of  locating  all  the 
phonemes  scored  for  discrete  speech  is  0.911  and  the  corre- 
sponding 95  percent  confidence  error  interval  is  0.0  to  0.06. 
For  the  68  events  scored,  the  net  probability  of  identifying 
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Table  IX 


Sentence 

Number 

1. 

2. 

3. 


Verification  Sentences 

Sentence 

Abraham  drafted  a note. 

See  me  wave  at  my  associate. 
A boy  got  out  the  back  gate. 


Joe  was  seen  around  the  airplane. 


the  phonemes  was  0.779  and  the  corresponding  95  percent  con- 
fidence error  interval  is  0.13  to  0.35.  The  continuous 
speech  had  a slightly  lower  net  probability  for  location 
of  0.882  and  the  corresponding  95  percent  confidence  error 
interval  was  0.055  to  0.21.  Finally,  the  net  probability  of 
identifying  the  phonemes  was  0.661  and  the  corresponding  95 
percent  confidence  error  interval  was  0.23  to  0.46. 

Test  Sentences 

The  phonemes  were  further  evaluated  by  scoring  them 
with  the  sentences  listed  in  Table  XI  that  were  spoken  by 
three  different  speakers.  Due  to  a preprocessing  error, 
either  in  the  recording  equipment  or  the  digitizing  equip- 
ment, noise  was  introduced  into  the  sentences.  Thus,  only 
the  sentences  presented  in  the  Table  XI  were  scored.  Further, 
as  can  be  seen  from  Table  XII,  not  all  speakers  for  these 
sentences  were  scored  due  to  the  noise  problem.  However, 
the  sample  size  is  still  appreciable  even  with  these  losses. 

As  can  be  seen  from  Table  XII,  the  net  probability  of 
locating  all  the  phonemes  scored  for  discrete  speech  is  0.779 
and  the  corresponding  95  percent  confidence  error  interval  is 
0.17  to  0.26.  For  the  377  events  scored,  the  net  probability 
of  identifying  the  phonemes  was  0.623  and  the  corresponding 
95  percent  confidence  error  interval  is  0.33  to  0.42.  The 
continuous  speech  had  a slightly  lower  net  probability  for 
location  of  0.661  and  the  corresponding  95  percent  confidence 
error  interval  was  0.29  to  0.40.  Finally,  the  net  probability 


Table  XI 


Test  Sentence  Group 


Sentence 

Number 

Sentence 

1. 

Abraham  drafted  a note. 

2. 

No  note  to  terminate  the  leave  of  the 

American  called  Caruso  was  drafted  this 
day. 

3. 

The  bright  bulb  formed  a ray  that  made 
a trace  of  the  rubber  rat. 

4. 

From  the  boat  docked  in  the  bay,  we  saw 
the  rhino,  leech,  and  toad  as  they  lay 
dead  along  the  tide. 

5. 

Before  the  trip,  the  rabbit  rested  along 
the  open  field  of  the  rancher. 

6. 

Does  Dennis  teach  reading,  or  does 

Dennis  teach  driving? 

7. 

Joe  was  seen  around  the  airplane. 

8. 

Take  a closer  look  at  Eastman  Kodak's 
bubbling  reagents  for  photo-resist 
stripping. 

9. 

Each  person  at  Beckman  sees  his 
responsibility  aimed  toward  fabricating 
better  resistors,  displays,  and  drugs. 

6 : 


Table  XII 
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(continued) 


Table  XXI- 


ror  instance/  in  uie  wuiu 


V-/11JL  2 


scored,  the  "TH"  sound  was  ignored  since  none  of  the  eight 
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VIII.  Hardware  Modelling  Analysis 

The  purpose  of  the  hardware  modelling  analysis  was  to 
investigate  the  feasibility  of  implementing  the  speech 
recognition  scheme  with  available  semiconductor  technology. 
Since  the  speech  recognition  scheme  is  computationally 
oriented,  it  was  decided  to  base  this  hardware  modelling  study 
around  a microprocessor.  Once  the  preliminary  design  was 
completed,  memory  technologies  and  sizes  were  selected  along 
with  compatible  peripheral  support  elements,  and  then  an 
overall  time-delay  analysis  was  accomplished.  It  was  from 
the  time-delay  analysis  that  the  limitations  of  the  hardware 
model  were  recognized.  Finally,  projections  of  future  semi- 
conductor technologies  were  appropriately  applied  to  recon- 
cile the  limitations  of  the  hardware  model  to  give  it  near 
real-time  capability. 

Microprocessor  Selection 

The  general  requirement  for  a microprocessor  to  have 
writable  control-store  and  multitasking  software  was  recog- 
nized as  a prerequisite  for  implementing  a signal  processing 
scheme  such  as  the  speech  recognition  routine  developed  in 
this  thesis.  These  characteristics  permit  the  scheduling  of 
several  programs  in  main  memory  at  once  for  execution  either 
simultaneously  or  at  different  times.  A microprocessor 
possessing  these  characteristics  and  selected  for  this  hard- 
ware design  was  the  Texas  Instruments  TMS  9900.  Several  of 
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the  important  features  of  this  microprocessor  are  listed  in 
Table  XIII  (Ref  27) . 

Microprogramming  and  Hardwire  Multiply 

The  conventional  division  of  functions  between  hardware 
and  software  in  a computer  system  severely  limits  signal 
processing  calculations  to  data  rates  of  a few  kilohertz  in 
real  time.  But  the  speed  of  the  complex  computations  asso- 
ciated with  signal  processing  can  be  increased  well  into  the 
megahertz  range  if  microprogramming  and  hardwire  multipli- 
cation are  implemented.  The  advantage  in  signal  processing 
time  using  microprogramming  and  hardwire  multiply  to  calcu- 
late the  Cooley-Tukey  DPT  is  shown  in  Table  XIV  (Ref  17) . 

Speech  Recognition  Flow  Chart 

The  first  step  taken  to  model  the  speech  recognition 
scheme  was  to  construct  a flow  chart  linking  the  vital 
processing  computations.  The  flow  chart  used  for  this 
hardware  modelling  analysis  is  shown  in  Figure  13. 

All  limitations  in  the  programs  and  processing  of  the 
speech  recognition  scheme  were  included  in  the  flow  chart 
and  subsequent  hardware  model.  For  example,  all  input  speech 
data  was  low-pass  filtered  to  5 kHz  and  sampled  at  a 10  kHz 
rate  in  the  analog-to-digital  conversion  process.  These 
limitations  influenced  the  selection  of  specific  peripherals 
to  complement  the  TMS  9900  microprocessor  and  in  performing 
the  all  important  time-delay  analysis.  The  overall  control 
of  the  sequence  of  operations  depicted  in  the  flow  chart 
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Figure  13.  Speech  Recognition  Chart 
(Plate  1) 
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Figure  13.  Speech  Recognition  Flow  Chart 
(Plate  3) 


Ram  Buffer  (FIFO) 


I (?)  I 

1 . I 


Final 

Output 


would  be  accomplished  by  an  executive  program  in  the  TMS 
9900  microprocessor. 

Hardware  Implementation 

The  flow  chart  served  as  the  basis  for  developing  a 
hardware  version  of  the  speech  recognition  process.  The 
system  design  is  shown  in  Figure  14.  The  specific  peripher- 
als selected  are  listed  in  Table  XV  (Ref  3,  6,  26,  27,  29, 
32) . Each  RAM  buffer  was  sized  according  to  the  following 
division  calculation: 


(Number  of  RAM  chips  needed) 


(Number  of  16-bit  words  to  be  stored) 
(Number  of  16-bit  words  per  chip) 


Similarly,  each  ROM  program  buffer  was  sized  according  to  the 
following  multiplication  and  division  calculations: 


(Number  of  executable  Fortran  statements) 
x (5  microcode  statements  per  executable  Fortran  statement) 
(Number  of  16-bit  microcode  statements) 


(Number  of  ROM  chips  needed) 


(Number  of  16-bit  microcode  statements) 
(Number  of  16-bit  words  per  chip) 


The  execution  time  to  either  load  or  read  a RAM  buffer 
was  calculated  using  the  following  formula  (Ref  26) : 

T = tc  + (M) (tA) 

where 

T = Total  execution  time 

tc  = Microprocessor  access  time 


AGC  I — Filte 


Figure  14.  Speech  Recognition  System  Design 
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Peripherals  Selected  to  Implement  the  Hardware  Model 
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M = Number  of  memory  accesses 


tA  = Maximum  memory  access  time 

Similarly,  the  execution  time  for  each  ROM  program  was  calcu- 
lated by  extrapolating  the  results  documented  for  the  micro- 
coded  128-point  DFT  program  (Ref  17).  Specifically,  the 
extrapolation  was  accomplished  using  the  following  division 
and  multiplication  calculations: 


(Time  to  execute  microcoded  DFT) 

(Number  of  executable  microcoded  instructions  in  the  DFT) 

(Average  execution  time  per  microcoded  statement) 

(Number  of  microcoded  instructions  per  ROM  Program) 

(Average  execution  time  per  microcoded  statement) 

(Total  execution  time  per  microcoded  ROM  program) 


Finally,  the  time  to  execute  the  hardware  multiplications  was 
estimated  from: 


Tt  * <TMA>  (h) 

where 


= Total  time  for  hardware  multipli- 
cations 

Tma=  Time  required  for  each  multipli- 
cation and  accumulation 

H = Number  of  calculations 


It  is  noted  here  that  the  memory  sizes  and  time  of 
execution  calculations  were  not  done  for  program  LING  or  the 
RAM  buffer  that  holds  the  results  of  LING.  This  particular 
program  was  not  a part  of  this  thesis,  but  would  be  necessary 
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for  operations  3,  7,  8,  and  9 of  3.66  seconds.  Translating 
this  advantage  for  the  entire  system  means  that  the  time- 
delay  to  process  an  entire  8.9  second  utterance  for  one  pho- 
neme will  be  9.24  seconds  and  for  100  phonemes,  it  will  be 
12.87  seconds.  Additionally,  new  16-bit  microprocessors  are 
being  developed  such  as  the  Intel  8086  that  has  an  operating 
speed  of  8 mHz  compared  to  3 mHz  for  the  Texas  Instruments 
TMS  9900  (Ref  14) . The  application  of  this  new  and  faster 
microprocessor  would  also  contribute  toward  the  objective  of 
recognizing  speech  in  near  real-time.  In  summary,  digital 
processing  components  either  currently  available  or  projected 
in  the  near  future  will  support  a near  real-time  realization 
of  what  has  been  to  date  exercised  as  an  offline,  non  real-time 
speech  recognition  process. 
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IX.  Conclusions 


This  research  effort  had  three  primary  goals: 

1.  Develop  a method  to  select  and  calculate  universal 
prototype  phonemes  so  as  to  improve  the  performance  of  the 
existing  method  of  multiple-speaker  and  continuous  speech 
recognition. 

2.  Modify  the  existing  speech  recognition  programs  so 
that  their  outputs  can  be  readily  analyzed  and  used  as  the 
input  to  a higher-order  syntactic  decision  scheme. 

3.  Model  the  speech  recognition  scheme  developed  in 
this  research  with  existing  and/or  projected  solid-state 
technology  to  permit  the  speech  recognition  process  to  be 
done  in  near  real-time. 

The  results  obtained  in  this  research  indicate  that 
these  goals  have  been  accomplished. 

The  results  obtained  for  the  word  groups  indicate  a 
minimum  95  percent  confidence  level  of  0.90  for  the  location 
of  any  phoneme  and  a 95  percent  confidence  level  of  0.80  for 
the  identification  of  any  phoneme.  These  figures  demonstrate 
that  the  phonemes  could  be  located  and  identified  satisfac- 
torily for  the  words  the  phonemes  were  taken  from.  They 
also  show  that  averaging  the  phonemes  from  the  two  speakers 
will  drop  the  results  from  a perfect  autocorrelation  but 
yield  high  enough  values  to  warrant  averaging  the  phonemes 
to  develop  a universal  phoneme.  Since  the  scores  were  not 
perfect,  this  may  indicate  that  averaging  accentuates  the 


V 


slight  differences  in  speech  patterns  between  speakers, 
such  as,  dialect  or  regional  speech  characteristics.  With 
this  assumption  a prototype  set  may  have  to  be  developed 
for  different  regions  and  dialects. 

The  verification  sentences  show  a slight  decrease  in 
the  scores  compared  to  the  word  groups.  For  discrete  speech, 
the  location  score  was  91.1  percent.  The  identification 
score  was  77.9  percent.  Due  to  the  way  the  previous  re- 
searchers determined  their  location  and  identification  scores, 
a direct  comparison  with  their  results  was  not  possible. 

However,  one  set  of  data  from  the  Guyote  and  Sisson  thesis 
could  be  compared  (Ref  11) . They  used  two  speakers  and  their 
phonemes  were  produced  by  averaging  two  phoneme  samples  to 
yield  an  averaged  prototype  phoneme.  In  contrast,  the  method 
used  in  this  thesis  averaged  28  sample  phonemes  from  two 
speakers  to  yield  a prototype  phoneme.  The  Guyote  and  Sisson 
score  for  location  was  97  percent  and  for  identification  it 
was  78.6  percent.  Since  they  only  scored  two  sentences  for 
each  speaker,  a detailed  comparison  between  the  scores  is 
not  warranted.  However,  a very  simple  comparison  of  the  raw 
scores  shows  them  to  be  comparable. 

The  continuous  speech  shows  a further  drop  in  the  scores 
with  88.2  percent  for  location  and  66.1  percent  for  identifi- 
cation. Only  one  sentence  was  scored  in  the  Guyote  and 
Sisson  thesis  producing  83.3  percent  for  both  location  and 
identification  (Ref  11) . The  results  still  show  some  con- 
sistency between  the  two  scores  with  the  present  results 

* 
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having  higher  location  but  lower  identification. 

The  test  sentences  show  a further  decrease  in  the  scores 
obtained.  The  location  of  phonemes  was  66.1  percent  and  the 
identification  was  48.6  percent.  There  are  no  scores  from 
the  previous  research  efforts  that  can  be  used  as  a basis  for 
comparison.  The  significant  drop  in  the  scores  seem  to  in- 
dicate that  the  phonemes  are  not  representative  of  those  of 
other  speakers. 

The  differences  in  the  scores  between  the  discrete  and  con- 
tinuous speech  may  be  reconciled  since  continuous  speech  merges 
phonemes  to  such  an  extent  that  the  phonemes  at  the  beginning 
of  some  words  are  smeared  together  and  may  be  overshadowed  by 
the  phoneme  at  the  end  of  another  word.  Also,  the  phonemes 
of  some  words  may  be  missed  entirely  but  the  brain  can  still 
comprehend  the  word.  For  example,  in  Sentence  3 of  the  Test 
sentences,  the  words  "The  Rubber"  can  be  analyzed.  In  the 
discrete  sentence,  the  E sound  in  "The"  is  identified  while 
the  R in  "Rubber"  is  located.  But  when  the  sentence  is  spoken 
continuously,  the  R is  missed  entirely  and  is  engulfed  by  the 
E sound  in  "The"*  Even  so,  the  missing  R does  not  impede  the 
brain  from  identifying  "Rubber"  from  the  context  of  the  sen- 
tence . 

However,  with  the  present  correlation  scheme,  if  the 
phoneme  was  smeared,  it  is  displayed  as  a miss.  As  a result, 
the  scores  show  the  drastic  reduction  as  noted.  Since  the 
location  scores  are  still  greater  than  the  identification 
scores,  this  indicates  that  the  phonemes  are  still  there  but 
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they  are  not  necessarily  the  first  choice  in  the  decision 
program  and  some  type  of  higher-order  decision  scheme  must 
be  used  to  select  the  correct  phoneme.  The  use  of  a higher- 
order  decision  scheme  to  help  identify  the  located  phonemes 
was  also  noted  by  Neyman  (Ref  19:71). 

Another  factor,  that  was  observed  in  the  sentences,  was 
that  the  vowels  seemed  to  overshadow  the  leading  consonants. 
It  is  inferred  from  Fletcher  that  combinations  of  consonants 
and  vowels  significantly  influence  their  frequency  density 
functions  (Ref  10:59).  Since  the  consonant  lengths  were 
generally  shorter  than  the  vowel  lengths,  the  consonant 
could  be  missed  and  the  vowel  could  still  be  identified. 

This  implies  that  the  basic  units  of  speech  should  not  be 
phonemes  but  combined  sounds  such  as  the  consonant-vowel 
sounds  covering  all  possible  combinations  as  presented  in 
Fletcher  (Ref  10:60-61). 

The  speech  recognition  programs  were  modified  to  permit 
a larger  amount  of  data  to  be  processed  per  execution.  This 
was  necessary  because  of  the  large  number  of  calculations 
and  tests  that  had  to  be  done.  These  programs  can  now 
process  any  length  sentence.  Finally,  each  of  the  programs 
were  fully  documented  to  allow  for  easy  interpretation  and 
rapid  modification. 

The  section  on  the  modelling  of  the  speech  recognition 
scheme  showed  that  with  present  technology,  the  processing 
of  an  8.9  second  segment  of  speech  would  take  as  long  as 
71.1  seconds  to  receive  an  initial  response  from  the 


processor.  It  was  shown  that  by  changing  the  64K  CCD  RAM's 
to  a faster  future  technology,  the  delay-tirae  could  be  re- 
duced to  about  13  seconds.  Also,  by  using  a proposed  faster 
16-bit  microprocessor,  the  processing  will  approach  real- 
time . 
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X.  Recommendations 


Two  classes  of  recommendations  for  continuation  of  this 
research  are  listed  below.  Class  I deals  with  methods  for 
phoneme  preparation,  analysis,  and  correlation.  Class  II 
deals  with  other  modifications  which  would  give  the  user 
greater  insight  into  the  correlation  performance. 

Class  I 

1.  Investigate  the  performance  of  the  recognition 
scheme  by  using  a larger  phoneme  set  where  the  phonemes  are 
consonant- vowel  combinations.  A recommended  set  would  be 
the  consonant-vowel  combinations  listed  in  Fletcher  (Ref 
10:60-61) . 

2.  A larger  population  of  speakers  and  sample  phonemes 
should  be  used  to  calculate  the  averaged  prototype  phonemes. 
At  least  10  speakers  should  be  used.  For  each  desired 
phoneme  sound,  each  speaker  should  say  at  least  12  sample 
words.  This  should  produce  a more  universal  set  of  averaged 
prototype  phonemes. 

3.  The  speakers  used  to  form  the  phoneme  set  should  be 
from  a common  region  and  have  similar  dialects.  Otherwise, 
the  phonemes  produced  might  accentuate  the  differences  in 
the  speech  patterns  and  reduce  the  correlation  values. 

Class  II 

1.  Have  the  Analog/Hybrid  Systems  Branch  reconstruct 
analog  speech  from  the  digitized  L- tapes.  This  should  be 


done  after  the  initial  digitization  (64-channels)  and  after 
the  logarithmic  compression  (16-channels) . This  will  verify 
that  the  speech  has  not  been  significantly  altered  from 
normal  speech  and  that  noise  has  not  been  introduced  into 
the  speech. 

2.  Modify  the  correlation  routine  to  accept  32-component 
frequency  vectors.  Compare  the  results  of  a 32-component 
frequency  vector  correlation  with  a 16-component  frequency 
vector  correlation.  The  added  information  contained  in  the 
32-component  frequency  vector  calculation  may  warrant  a 
permanent  modification  to  the  correlation  program. 
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A . Data  Processing  Charts  and  Notes 

This  appendix  contains  the  flow  charts  of  the  seven 
programs  used  in  the  speech  recognition  process.  These  flow 
charts  outline  the  operation  of  each  program. 

Also  included  are  notes  which  clarify  the  important 
operating  points  of  each  program.  Table  XVIII  lists  the 
seven  programs  along  with  their  associated  inputs  and  outputs 


Table  XVIII 

Data  Processing  Programs 

Name 

Input 

Output 

EKl  ( OCTAVE 1) 

L-tape  #1 

L-tape  #2  Spectrogram 

EK2  (OCTAVE 2) 

L-tape  #2 

Normalized  Spectrogram 

EK3  (PUNCH) 

L-tape  #2 

Punched  Cards  #1 

(Target  Phonemes) 

EK4  (PROAVE) 

Punched  Cards  #1 

Punched  Cards  #2 

Prototype  Phonemes) 

EK5  (CRSCOR) 

L-tape  #2 

PF 

Punched  Cards  #2 

(Correlation  arrays) 

Calcomp  Graphs 

EK6  (FPLOT) 

PF 

Cal  con??  Graphs 

EK7  (DECIS) 

PF 

Phonemic  Output 

Note:  PF  refers 

to  Permanent  Files  on  the 

CDC  system. 

EK1  ( OCTAVE 1) 

EKl  ( OCTAVE 1)  processes  the  64-component  vectors  on 
L-tape  #1  as  input  and  assigns  it  as  a local  file  name  called 
Tape  1.  L-tape  #1  is  the  digitized  data  produced  by  the  ASD 
Computer  Center.  Each  line  of  data  on  L-tape  #1  consists  of 


J 
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two  leading  numbers  (NCHAN  and  NDIM)  followed  by  the  64 
components  of  each  frequency  vector.  NCHAN  and  NDIM  are 
produced  by  the  subroutine  used  to  calculate  the  DFT . EK1 
reads  each  line  of  data  but  only  processes  the  64  components. 

The  program  logarithmically  compresses  the  data  from 
64  components  per  frequency  vector  to  16  components  per 
frequency  vector.  This  process  causes  the  higher  frequencies 
to  be  emphasized.  The  results  of  this  compression  are  stored 
in  a local  file  called  Tape  2 and  written  on  L-tape  #2  for 
use  in  subsequent  programs.  EKl  also  produces  a non-normalized 
spectrogram  of  the  speech  data. 

The  two  variables  which  are  important  in  EKl  are  NREC 
and  NN2 . NREC  represents  the  number  of  files  to  be  read.  A 
file  is  defined  to  be  the  digitized  speech  between  the  2 kHz 
tones  on  the  recording  tape.  NN2  is  set  to  be  one  more  than 
the  number  of  records  contained  in  the  largest  file.  The 
number  of  files  and  the  number  of  records  contained  in  each 
file  is  available  on  the  printout  received  from  the  ASD 
Computer  Center. 


96 


EK2  (0CTAVE2) 


The  input  to  EK2  (0CTAVE2)  is  the  compressed  data  on 
L-tape  #2  created  by  EKl.  EK2  only  produces  a normalized 
spectrogram  of  the  speech  data.  This  presents  the  speech 
data  in  a more  easily  interpreted  form  than  EKl.  Although 
it  is  necessary  to  read  the  entire  sentence  record  to  pro- 
duce the  spectrogram,  the  two  variables  NS TART  and  NSTOP 
allow  the  user  to  select  portions  of  the  speech  file  of 
interest.  The  entire  file  will  be  read  and  stored  on  a 
local  file  called  Tape  1.  However,  only  the  desired  portions 


of  the  speech  data  (between  NSTART  and  NSTOP)  will  be  output 
in  the  normalized  spectrogram.  The  total  number  of  speech 
files  to  be  read  is  set  by  NREC . 


EK  3 (PUNCH) 


EK3  (PUNCH)  uses  the  data  on  L-tape  #2  as  input.  EK3 
is  used  to  select  the  phonemes  from  the  word  groups  and  it 
produces  a set  of  punched  cards  for  each  target  phoneme 
(punched  cards  #1) . 

During  the  phoneme  selection  process  the  locations  of 
the  target  phonemes  were  noted  by  recording  the  values  of  the 
time  increments  given  on  the  normalized  spectrogram.  The 
length  of  a target  phoneme  was  the  same  within  a word  group. 

The  program  can  only  be  used  to  process  one  word  group 
from  one  speaker  at  a time.  The  beginning  values  of  the 
target  phonemes  were  put  into  the  IBGN  data  statement.  The 
end  values  of  the  target  phonemes  were  put  into  the  IEND  data 
statement.  The  program  reads  in  the  entire  word  group  and 
selects  the  target  phonemes  according  to  the  data  statements. 

The  program  generates  14  sets  of  punched  cards  for  each 
word  group.  A printout  of  these  values  is  also  produced. 

The  16  components  for  each  frequency  vector  are  contained  on 
two  punched  cards. 

This  process  must  be  repeated  for  each  speaker's  group 
of  words.  This  results  in  28  sets  of  target  phonemes  for 
each  word  group,  assuming  there  are  two  speakers. 
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EK4  (PROAVE) 


EK4  (PROAVE)  uses  the  punched  cards  of  EK3  as  input 
and  averages  the  28  target  phonemes  of  a particular  word 
group  to  yield  an  averaged  prototype  phoneme.  ET.  4 then 
produces  a set  of  punched  cards  for  the  averaged  prototype 
phoneme  (punched  cards  #2)  . 

The  program  does  the  averaging  by  summing  all  28  values 
of  a specific  frequency  vector  component  and  then  dividing 
by  28  to  give  an  average  value  for  the  component.  When  all 
16  components  of  a frequency  vector  have  been  averaged 
the  program  produces  a punched  card  output  of  the  vector. 

The  variables  that  control  this  process  are  JE,  JI , DIV, 
KARD.  KARD  is  the  number  of  lines  of  input  data  (half  the 
number  of  cards) . DIV  is  the  number  of  target  phonemes  that 
are  to  be  averaged.  JI  is  the  length  of  the  target  phonemes. 
JE  is  (1  + KARD  - JI) , which  gives  the  first  line  of  data  of 
the  last  target  phoneme. 

This  program  must  be  run  for  each  prototype.  The  end 
product  of  this  process  is  a set  of  eight  averaged  prototype 
phonemes . 
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EK5  (CRSCOR) 


This  program  comprises  the  main  body  of  the  research. 

EK5  (CRSCOR)  consists  of  a main  program  (CRSCOR)  which  inputs 
selected  variables  that  are  established  using  the  comments 
in  the  program.  Following  the  initialization  of  the  desired 
variables,  the  main  program  calls  a subroutine  (XCORR)  which 
handles  the  correlation  computations. 

The  sentence  data  is  attached  on  Tape  3 and  is  read  from 
L-tape  #2.  The  SKIPF  control  card  is  used  to  skip  down  to 
the  file  on  L-tape  #2  containing  the  desired  speech  data. 

The  value  put  into  the  SKIPF  card  is  one  less  than  the  number 
of  the  file  desired.  The  prototype  data  is  attached  as  Tape  1 
and  is  read  from  punched  cards  (Punched  Cards  #2)  produced  by 
EK4 . Tape  2 is  a local  file  used  to  store  the  prototypes  for 
each  speech  segment  so  they  only  need  to  be  read  in  once  for 
each  block  of  speech. 

The  main  program  is  organized  to  accept  data  in  blocks  of 
700  frequency  vectors.  If  the  data  is  longer  than  this,  the 
utterance  must  be  segmented  and  the  value  of  IRUN  adjusted 
accordingly.  The  block  of  comments  at  the  beginning  of  the 
program  listing  (Appendix  B)  gives  the  statement  numbers  of 
the  variables  that  must  be  changed  for  each  run. 

Subroutine  XCORR  calculates  the  correlation  values  of 
each  prototype  with  the  sentence  and  produces  an  array  con- 
taining this  information.  The  printout  includes  a listing  of 
the  sentence  data,  prototype  values,  and  correlation  computa- 
tions. The  program  also  prints  information  concerning  the 


subdivisions  of  the  sentence,  number  of  zeros  required  to 
augment  the  data  arrays,  and  prototype  lengths. 

The  subroutine  XCORR  calls  the  plot  routine  which  pro- 
duces a correlation  versus  time  graph  for  each  prototype. 
Since  the  correlation  values  for  up  to  nine  prototypes  can 
be  plotted,  the  DISPOSE  command  was  used  to  load  several 
buffers  and  permit  the  Calcomp  plotter  to  output  the  results 
in  groups  of  nine  graphs. 

Finally,  the  subroutine  XCORR  writes  the  correlation 
values  into  a permanent  file  called  Tape  4.  An  end-of-file 
is  then  placed  after  the  last  correlation  value. 

Each  phase  of  the  correlation  processing  is  well  docu- 
mented to  make  it  easy  to  follow.  This  fact  allows  rapid 
changes  or  revisions  to  be  made  to  the  program. 


AJ 
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EK6  ( FPLOT) 


EK6  (FPLOT)  attaches  the  PF  generated  in  CRSCOR  as 
local  file  Tape  1.  It  reads  selected  portions  of  the  cor- 
relation arrays  into  an  array  called  SAMPLE.  These  values 
are  then  sent  to  special  graphing  routines  which  have  been 
attached  to  the  program  through  the  control  cards.  Following 
the  graph  calls,  the  resulting  data  is  sent  to  the  Calcomp 
Plotter  through  the  use  of  the  CALL  PLOTE  (N)  instruction. 

The  output  can  be  adjusted  to  provide  the  same  plot  as 
was  produced  with  the  correlation  routine.  However,  this  plot 
routine  is  more  versatile  and  can  be  used  to  plot  any  of  the 
prototypes  at  any  point  in  the  correlation  array  by  varying 
the  IBGNl  and  IEND1  data  statements  and  NPRO.  The  labels  of 
the  axis  on  the  graphs  must  be  changed  according  to  the 
prototypes  being  plotted.  The  other  variables  that  must  be 
changed  for  each  run  are  listed  in  the  comments  at  the  be- 
ginning of  the  program  (Appendix  B) . 
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EK7  (DECIS) 


EK7  (DECIS)  attaches  the  PF  generated  in  CRSCOR  as 
local  file  Tape  1 and  processes  the  correlation  arrays 
according  to  the  methods  described  in  Section  VI.  The 
input  arrays  which  contain  information  concerning  the 
phonemes  and  their  lengths  must  be  adjusted  for  each  set 
of  phonemes.  The  variables  ENDUR,  DELTA  and  THRHLD  are 
the  endurance  (time) , rate-of-change  criterion,  and  cor- 
relation threshold  values,  respectively.  This  program 
generates  a list  of  the  phonemes  identified  in  a sentence. 
The  list  gives  the  eight  phonemes  with  the  highest  corre- 
lation values  versus  time.  It  is  possible  to  store  these 
results  in  permanent  files  in  order  to  preserve  these 
processed  arrays  for  a higher-order  decision  scheme. 


Figure  21.  EK7  (DECIS)  Flow  Diagram 
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IF(J.EQ.5)  GO  TO  T01 
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IF ( J , EO.9 ) GD  TO  701 
11  CALL  PLOT ( 3.0, 0. 0, -3) 
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J J= J+NSTA RT 

rPTN"  411 ,JJ,  ( T Y M F 0 L 1 ( IPHOil  ( J » N ) ) tN»l 


411  FOFMfll  (IX, I4,5X,9(A7t6X)  ) 


Table  XIX 


Scoring  Symbol  Set 


Phoneme 

Symbol 

Lead  B 

B 

Lead  D 

D 

Lead  R 

R 

Lead  T 

T 

Long  A 

A 

Long  0 

0 

Long  E 

E 

Auh 

<3 

Scoring  Descriptors 

Symbol 

Located 

L 

Identified 

I 

Missed 

0 

Not  Evaluated 


Author  1 


Author  2 


Combined 


Not  scored  due  to  irreversible  data  preprocessing  malfunction 


Table  XX 

B-Word  Group  Analysis 

Author  1 

Author 

_2 

Phonemic 

B All 

B 

All 

Word 

Rendition 

Phoneme  Phonemes 

Phoneme 

Phonemes 

Bay 

BA 

*_  *i 

I- 

II 

Babble 

B-B- 

H 

1 

H 

1 

M 

1 

H 

1 

I-I- 

I-I- 

Batter 

3-T-R 

I I-I-L 

X 

I-L-I 

Be 

BE 

0-  01 

0- 

01 

Bench 

B 

x x 

I — 

I 

Bitter 

B-T-R 

I I-I-I 

I — 

I-O-L 

Bite 

B-T 

I—  1-0 

I— 

I-I 

Boat 

BOT 

L—  LIO 

I— 

III 

Bought 

B@- 

I—  II- 

L — 

LI- 

Buy 

B- 

I-  I- 

I- 

I- 

Butter 

B@T-R 

I ILI-L 

I 

IIO-L 

Ble  nd 

B 

L L 

I 

I 

Bright 

B — T 

I—  1—0 

I 

I — I 

Bulb 

B— B 

L— I L — I 

Overall  Performance 

B-Phoneme  Score 

I--I  I--I 

All  Phoneme  Score 

Speaker(s) 

Located  Identified 

Located 

Identified 

Table  XXI 


D-Word  Group  Analysis 


Word 

Phonemic 

Rendition 

Author 

D 

Phoneme 

All 

Phonemes 

Author 

D 

Phoneme 

2 

~ All 
Phonemes 

Day 

DA 

I- 

II 

L- 

LI 

Debt 

D — 

I— 

I — 

I— 

I— 

Debit 

D-B-T 

x 

I-I-O 

x 

I-I-I 

Ditto 

D-TO 

I 

I-OI 

I 

I-II 

Donut 

DO— 

X 

II-I- 

I 

IL-I- 

Dug 

D@- 

I— 

II- 

I— 

II- 

Dust 

D@-T 

I — 

II-I 

I 

II-I 

Drafted 

D — T-D 

I 1 

I— I-I 

I 1 

I-I-I 

Danger 

DA— R 

X 

II  — I 

l 

II— L 

i 

Dagger 

D — R 

I — 

I — I 

I 

I — L 

Dread 

DR-D 

I— I 

II-I 

I — I 

II-I 

Dead 

D-D 

1-0 

1-0 

I-I 

I-I 

Dodge 

D@- 

I — 

II- 

I— 

II- 

Dude 

D-D 

I-I 

I-I 

L-L 

L-L 

Overall  Performance 

D-Phoneme  Score  All  Phoneme  Score 


Speaker (s) 

Located 

Identified 

Located 

Identified 

Author  1 

S— 

it— 

3T91'24 

11=91  2% 

34 

Author  2 

if-100% 

15 

±2=83. 3% 

18 

ir1004 

f-82.4% 

Combined 

32 

±±=88.9% 

36 

S'— 

||-86. 8» 

Table  XXII 


R-Word  Group  Analysis 


Author 

1 

Author 

_2 

Phonemic 

R 

All 

R 

All 

Word 

Rendition 

Phoneme 

Phonemes 

Phoneme 

Phonemes 

Rat 

R-T 

I — 

I-I 

I — 

I-I 

Read 

RED 

I— 

III 

I— 

III 

Ride 

R-D 

I — 

I-I 

I — 

I-I 

Robe 

ROB 

I— 

III 

I— 

III 

Rut 

R@T 

I — 

no 

I— 

III 

Rhino 

R— 0 

I 

I— I 

I 

I— L 

Rather 

R — R 

I— L 

I— L 

I — I 

I— I 

Rear 

R-R 

I-L 

I-L 

I-I 

I-I 

Right 

R-T 

I— 

I-L 

I — 

I-I 

Resist 

RE — T 

* 

II — I 

X 

II— I 

Rand 

R — D 

L 

L—  L 

I 

I— I 

Rover 

RO — R 

I 1 

II  — I 

I 1 

II— I 

Rare 

R-R 

I-I 

I-I 

I-I 

I-I 

Rubber 

R@B-R 

I 1 

IIL-I 

I 1 

III-I 

Overall  Performance 

R-Phoneme  Score 

All  Phoneme  Score 

Speaker (s) 

Located 

Identified 

Located 

Identified 

Author  1 

if-84.2% 

§■97. 1% 

1—80* 

Author  2 

if-1004 

~=ioo% 

§-100. 

§.97.1* 

Combined 

f.100% 

lh2-14 

f-98.6% 

§-88.6* 

L68 


Table  XXIII 


T-Word  Group 

Analysis 

Word 

Taker 

Phonemic 

Rendition 

Author 

T 

Phoneme 

_1 

All 

Phonemes 

Author 

T 

Phoneme 

2 

All 

Phonemes 

1 

Terminate 

TR—  A- 

II— I- 

Tide 

T-D 

I-- 

I-L 

I — 

I-I 

Tight 

T-T 

I-I 

I-I 

I-I 

I-I 

Toad 

TO- 

I — 

II- 

L— 

LI- 

Tore 

T — 

I— 

I-- 

L — 

L — 

Tub 

T@B 

I— 

III 

I— 

III 

Tube 

T-B 

I— 

I-L 

I — 

I-L 

Through 

* 

Tither 

* 

Tribe 

T — B 

I 

I— I 

I 

I — I 

Tip 

T — 

I— 

I— 

I — 

I — 

Twist 

T — T 

I— I 

I— I 

I — I 

I— I 

Trade 

TRAD 

I 

ILII 

I 

ILII 

Overall  Performance 

T-Phoneme  Score 

All  Phoneme  Score 

Speaker (s) 

Located 

Identified 

Located 

Identified 

Author  1 

14 

ir100% 

14 

ir100% 

It96*3% 

§f85.1% 

Author  2 

14 

ir100% 

IT78'5' 

If  100% 

§f  17.7% 

Combined 

28 

Is" 1003 

§§-89.3, 

It98' 

81. 5% 

★ , 
Not  scored 

due  to  irreversible  data 

preprocessing  malfunction. 
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Table  XXV 


AUH  (@)-Word  Group  Analysis 


Author 

1 

Author 

2 

Phonemic 

AUH 

All 

AUH 

All 

Word 

Rendition 

Phoneme 

Phonemes 

Phoneme 

Phonemes 

Among 

<3 

I 

I 

I 

I 

About 

@B-T 

I 

II-O 

I 

II-I 

American 



Topeka 

TO-E-@ 

* 1 

1 

LL-I-I 

Santa 

T@ 

j 

— II 

x 

II 

Mascara 

“-“\9 

1 

Another 

@ R 

! 

I L 

X 

I 1 

Caruso 

-Sr— o 



-II— I 

-X 

-II— I 

Appear 

@ — R 

L 

L— L 

I — 

I— I 

Attempt 

@T 

L 

LI 

L— - 

LI 

Accumulate 

(g— — — — — Ai 

S-O-E- 

Li  — — — — ljU 

L-I-I- 

Associate 

Approximate 

@ 

I — I 

I — I — 

I— x— 

X — X — 

Against 

@ T 

L 

L 1 

I 

I 1 

Overall  Performance 

AUH-Phoneme  Score 

All  Phoneme  Score 

Speaker(s) 

Located 

Identified 

Located 

Identified 

Author  1 

Af-loo* 

^<=66.6% 

±-§=93  3% 

30 

f.66.6% 

Author  2 

TTl00% 

B*eo% 

^SO.6% 

Combined 

If100' 

f-73-3' 

59 

ir96-7% 

6 r73*7% 

*Not  processed  due  to 

preprocessing 

error. 
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Table  XXVI 

E-Word  Group  Analysis 


Author 

1 

Author 

2 

Phonemic 

E 

All 

E 

All 

Word 

Rendition 

Phoneme 

Phonemes 

Phoneme 

Phonemes 

Leave 

-E- 

-I- 

-I- 

-I- 

-I- 

Each 

E- 

I- 

I- 

I- 

I- 

Me 

-E 

-I 

-I 

-I 

-I 

See 

-E 

-I 

-I 

-I 

-I 

Even 

E— 

I — 

I— 

I — 

I— 

Leach 

-E- 

-I- 

-I- 

-I- 

-I- 

Be  at 

BET 

-I- 

Oil 

-I- 

Oil 

Meet 

-ET 

-I- 

-II 

-I- 

-II 

Sleep 

— E- 

— I- 

--I- 

— I- 

-I- 

Valley 

E 

L 

L 

1 

1 

Reek 

RE- 

-I- 

II- 

-I- 

II- 

Key 

-E 

-I 

-I 

-I 

-I 

Egress 

E-R- 

I 

I-I- 

I 

I-I- 

Ego 

E-0 

I-- 

I- 1 

I— 

I-I 

Overall  Performance 

E-Phoneme  Score 

All  Phoneme  Score 

Speaker(s) 

Located 

Identified 

Located 

Identified 

Author 

1 

14 

ir100% 

TT*=92.8% 

14 

-^~=95% 

20 

^r90% 

20 

Author 

14 

19 

2 

ir1°°% 

ir100% 

20 

20'95% 

Combined 

^§<=100% 

28 

27 

— = 96.4% 

28 

—*95% 

40 

•^=92.5% 

40 

172 


Table  XXVII 


0-Word  Group  Analysis 

Author  1 Author  2 

Phonemic  0 All  0 All 


Word 

Rendition 

Phoneme 

Phonemes 

Phoneme 

Phonemes 

Go 

-0 

-I 

-I 

-I 

-I 

So 

-0 

-I 

-I 

-I 

-I 

Blow 

B-0 

~I 

L-I 

—I 

I-I 

Obey 

OBA 

I — 

IIL 

I— 

III 

Omit 

0 — 

I — 

I— 

I— 

I— 

Over 

0-R 

I— 

I-I 

I— 

I-I 

Note 

-0T 

-I- 

-II 

-I- 

-II 

Those 

-0- 

-I- 

-I- 

-I- 

-I- 

Pose 

-0- 

-I- 

-I- 

-I- 

-I- 

Rose 

-0- 

-I- 

II- 

-I- 

II- 

Nose 

-0- 

-I- 

-I- 

-I- 

-I- 

Most 

-0-T 

-I— 

-I-I 

-I — 

-I-I 

Both 

B0- 

-I- 

LI- 

-I- 

II- 

No 

-0 

-I 

-I 

-I 

-I 

Overall 

Performance 

0- Phone me  Score 

All  Phoneme  Score 

Speaker (s) 

Located 

Identified 

Located 

Identi fied 

Author 

1 

IT100* 

14 

IT100' 

§§-100% 

§§-86% 

Author 

2 

it100' 

IT100' 

§§-100% 

§§-100% 

Combined 

§|.100% 

§1-100% 

44 

TT100* 

44 

jr93-14 

3 
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Table  XXVIII 


Verification  Sentence  Analysis 

"Abraham  drafted  a note . " 

Author  1 

Author  2 

Phonemic 

Word 

Rendition 

Discrete 

Continuous 

Discrete 

Continuous 

Abraham 

ABR@ — 

ILIL — 

ILIL— 

ILIL— 

IOIL — 

Drafted 

D — T-D 

I — I-I 

I — O-I 

I— I-I 

I — L-I 

A 

@ or  A 

I 

L 

I 

I 

Note 

-0T 

-10 

-10 

-II 

-II 

Overall 

Performance 

Discrete  Score 

Continuous  Score 

Spe aker (s) 

Located 

Identified 

Located 

Identified 

Author  1 

9 

7 

-r-r=70% 

t^=80% 

t|=50% 

10 

10 

10 

10 

Author  2 

T7r100% 

'r~=80% 

9 

rr=90% 

T7=70% 

10 

10 

10 

10 

19 

15  „ 

17 

12 

Combined 

-=-^=95% 

^=75% 

•=-<=85% 

^r=60% 

20 

20 

20 

20 

Combined 


Table  XXIX 


Veri fication 

Sentence  Analysis 

"See 

me  wave 

at  my  associate." 

Author  1 

Author 

■ 2 

Phonemic 

Word 

Rendition 

Discrete 

Continuous 

Discrete 

Continuous 

See 

-E 

-I 

-I 

-I 

-I 

Me 

-E 

-I 

-I 

-I 

-I 

Wave 

-A- 

-I- 

-I- 

-I- 

-I- 

At 

-T 

-0 

-0 

-I 

-L 

My 

Associate 

@-0-E- 

L-I-I- 

I-I-I- 

I-I-L- 

L-I-I- 

Overall 

Performance 

Discrete  Score 

Continuous  Score 

Speaker (s) 

Located 

Identified 

Located 

Identified 

Author  1 

— 1=85.7% 

— 1=71.4% 

-|=85.7% 

-|=85.7% 

Author  2 

-^=100% 

-|=85.7% 

-^=100% 

-1=71.4% 

7 

7 

7 

7 

13  „ 

11 

13  „ „ 

11  r- 

Combined 

rr=92.8% 

14 

ir78*5% 

ir92-8% 

IT78'5' 

Table  XXX 


Verification  Sentence  Analysis 
"A  boy  got  out  the  back  gate . " 

Author  1 Author  2 


Word 

Phonemic 

Rendition 

Discrete 

Continuous 

Discrete 

Continuous 

A 

0 or  A 

I 

I 

I 

I 

Boy 

B- 

L- 

I- 

I- 

I- 

Got 

-@T 

-10 

-10 

-II 

-IL 

Out 

-T 

-0 

-0 

-I 

-0 

The 

-E  or 

-I 

-I 

-I 

-L 

Back 

B — 

I — 

I — 

I— 

I— 

Gate 

-AT 

-10 

-IL 

-II 

-II 

Overall 

Performance 

Spe aker (s) 

Discrete  Score 
Located  Identified 

Continuous  Score 
Located  Identified 

Author 

1 

—^=66 .6% 
9 

-|=5S.5% 

— =77 . "7% 

—^=66 .6% 

Author 

2 

9 

—^=100% 

9 

— |=100% 

— |=88 .8% 

“-=66 .6% 
9 

Combined 

TT83-3* 

it83-3* 

if-66.6% 

Table  XXXI 

Verification  Sentence  Analysis 

"Joe  was  seen  around  the  airplane 

II 

Author  1 

Author 

2 

Phonemic 

Word 

Rendition  Discrete  Continuous 

Discrete 

Continuous 

Joe 

-0 

-I 

-I 

-I 

-I 

Was 

-I- 

-L- 

-I- 

-I- 

Seen 

-E- 

-I- 

-L- 

-I- 

-I- 

Around 

@R — D 

ii— r 

LO — I 

II — 0 

II— L 

The 

-E  or  -@ 

-i 

-I 

-I 

-I 

Airplane 

Overall 

Performance 

Discrete 

Score 

Continuous  Score 

Speaker (s) 

Located 

Identified 

Located 

Identified 

Author  1 

-f-100% 

T87'5' 

-r87-84 

nr37-54 

Author  2 

-r87- 54 

~T75% 

-f-100% 

-f-8,.5% 

Combined 

IT93-74 

IT81-24 

IT93-74 

IT62'54 
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Table  XXXII 


I 


Test  Sentence 

Analysis 

II 

Abraham  drafted  a note . " 

Word 

Phonemic 

Rendition 

Speaker  1* 

Discrete  Continuous 

Abraham 

ABR@ — 

OILI — 

0101 — 

Drafted 

D — T-D 

I — L-I 

0 — L-I 

A 

§ or  A 

I 

I 

Note 

-0T 

-10 

-10 

Overall  Performance 

Speaker (s) 

Discrete 

Located 

Score 

Identified 

Continuous  Score 
Located  Identified 

Speaker  1 

-^=80% 

10 

ir60% 

IT60' 

IT50' 

*Speaker  2 and  Speaker  3 not  scored 
preprocessing  malfunction. 

due  to  irreversible  data 
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Table  XXXIII 


(continued) 


Table  XXXIIT — continued 


Not  scored  due  to  irreversible  data  preprocessing  malfunction. 


Table  XXXIV 


Test  Sentence  Analysis 

"The  bright  bulb  formed  a ray  that  made  a trace  of  the  rubber  rat." 


Word 

Phonemic 

Rendition 

Speaker 
Discrete  < 

_1* 

Continuous 

Speaker 

Discrete 

_3 

Continuous 

The 

-A  or  -@ 

-I 

-L 

-I 

-I 

Bright 

B — T 

I— L 

L — 0 

I— 0 

I— 0 

Bulb 

B — B 

I— I 

I— I 

I— I 

I--I 

Formed 

D 

1 

0 

1 

0 

A 

@ or  A 

I 

I 

I 

I 

Ray 

RA 

LO 

OO 

10 

10 

That 

— 

— 

— 

— 

— 

Made 

-AD- 

-01 

-00 

-01 

-01 

A 

@ or  A 

I 

I 

I 

L 

Trace 

T-A- 

I-I- 

1-0- 

I-L- 

I-L- 

Of 

§- 

I- 

I- 

L- 

I- 

The 

-E  or  -@ 

-I 

-I 

-I 

-I 

Rubber 

R@B-R 

LII-L 

01 1 -0 

LII-I 

OII-L 

Rat 

R-T 

0-0 

0-0 

L-L 

1-0 

Overall  Performance 

Speaker (s) 

Discrete  Score 
Located  Identified 

Continuous  Score 
Located  Identified 

Speaker  1 

I?*81*84 

^1=63.6% 

35-40.9% 

Speaker  3 

19 

^86.4% 

if63-6' 

if-72.7% 

±f-59.1% 

Combined 

If84-1' 

fr63-6% 

zr61-4% 

44 

!r50' 

*Speaker  2 not  scored  due  to  irreversible  data  preprocessing 
malfunction. 


Table  XXXV 


Test  Sentence  Analysis 

"From  the  boat  docked  in  the  bay,  we  saw  the  rhino, 
leech,  and  toad  as  they  lay  dead  along  the  tide." 


Word 

Phonemic 

Rendition 

Speaker  1 Speaker  2* 

Discrete  Continuous  Discrete  Continuous 

From 

— 0-  * 

* __I_ 

— I- 

The 

-A  or 

-§ 

-I 

-I 

"I 

Boat 

BOT 

IIO 

III 

ILO 

Docked 

In 

The 

D@-D 

II-I 

IL-I 

II-I 

-A  or 

-<a 

-I 

-I 

-L 

Bay 

BA 

10 

10 

10 

We 

-E 

-L 

-0 

-°  | 

Saw 

-<a 

-I 

-I 

-I 

The 

-A  or 

-<a 

-I 

-I 

-I 

Rhino 

R — 0 

I--I 

0— I 

0— I 

Leech 

-E- 

-L- 

-0- 

-0- 

And 

— D 

—I 

“I 

— 0 1 

Toad 

As 

They 

TO- 

II- 

II- 

II- 

-A 

-0 

-0 

-0 

Lay 

-A 

-0 

-0 

-0 

Dead 

D-D 

0-1 

0-L 

0-1 

Along 

<a~ 

I-- 

I — 

I-- 

(continued) 
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Table  XXXV — continued 

Test  Sentence 

Analysis 

M 

From  the  boat  docked  in  the  bay,  we  saw  the  rhino, 
leech,  and  toad  as  they  lay  dead  along  the  tide." 

Word 

Phonemic  Speaker  1 

Rendition  Discrete  Continuous 

Speaker  2* 

Discrete  Continuous 

The 

-A  or  -I 

* 

-I 

-I 

Tide 

T-D  0-1 

* 

I-I 

L-I 

Overall  Performance 

Discrete  Score 

Continuous  Score 

Speaker (s) 

Located 

Identified 

Located 

Identified 

Speaker  1 

■^r75% 

28 

^1=67.9% 

* 

* 

Speaker  2 

f|«=75% 

28 

if-67.9. 

if-67.9. 

Combined 

1^=75% 

56 

f-67.8. 

if-67.9* 

if-57.1* 

^Speaker  3's  test  sentence  and  Speaker  l's  continuous  speech 
was  not  processed  due  to  an  irreversible  preprocessing  malfunction. 
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rable  XXXVI 

Test  Sentence  Analysis 

"Before  the 

trip 

, the  rabbit  rested  along 

the  open  field  of  the 

rancher. " 

Word 

Phonemic 

Rendition 

Speaker  1 

Discrete  Continuous 

Speaker  2* 

Discrete  Continuous 

Before 

BE-R 

II-I 

II-I 

II-I 

IL-0 

The 

-E  or  -@ 

-I 

-I 

-I 

-I 

Trip 

T 

I 

I 

I 

I 

The 

-E  or  -@ 

-I 

-I 

-I 

-I 

Rabbit 

R-B-T 

L-I-0 

I-I-O 

L-I-L 

I-I-I 

Rested 

R-TD 

I -01 

I -00 

O-II 

I -01 

Along 

<§-- 

I— 

I— 

I— 

I-- 

The 

-E  or  -@ 

-I 

-I 

-I 

-I 

Open 

0— 

I~ 

I— 

I — 

L— 

Field 

-E-D 

-I-I 

-O-L 

-0-1 

-0-1 

Of 

<a- 

I- 

I- 

I- 

I- 

The 

-E  or  -@ 

-I 

-I 

-I 

-I 

Rancher 

R R 

L L 

I 1 

I 1 

L L 

Overall  Performance 

Speaker (s) 

Discrete  Score 
Located  Identified 

Continuous  Score 
Located  Identified 

Speaker  1 

g-9°.5% 

M-76-2' 

±2-87. 

if-76.3. 

Speaker  2 

if-90.5. 

if-85.7. 

^66.7% 

Combined 

JT90-5* 

H-78.6. 

H-83.3. 

It71-4' 

* 

Speaker  3 not 
malfunction. 

scored  due 

to  irreversible  data  preprocessing 

Table  XXXVII 


Test  Sentence  Analysis 


"Does  Dennis  teach  reading  or  does  Dennis  teach  driving?" 


Phonemic 

Speaker 

2* 

Speaker 

3 

Word 

Rendition 

Discrete 

Continuous 

Discrete 

Continuous 

Does 

D@- 

II- 

* 

II- 

LI- 

Dennis 

D 

0 

* 

I 

L 

Teach 

TE- 

IL- 

* 

IL- 

LO- 

Reading 

RED- 

no- 

♦ 

III- 

ILO- 

Or 

-R 

-L 

* 

-I 

-I 

Does 

D@- 

LI- 

* 

II- 

LL- 

Li— 

Teach 

TE- 

II- 

* 

II- 

LI- 

n 

* 

Overall  Performance 

Speaker (s) 

Discrete  Score 
Located  Identified 

Continuous  Score 
Located  Identified 

Speaker  2 

IT80% 

9 

—2=60% 

15 

* 

* 

Speaker  3 

IT93-3* 

i|=86.6% 

Yg=86 .6% 

4 

•j-g=26 . 6% 

Combined 

ir86-64 

22 

3T73‘3% 

y|=86.6% 

4 

Yg*=26 .6% 

*Speaker  l's  test  sentence  and  Speaker  2's  continuous  speech 
was  not  processed  due  to  an  irreversible  preprocessing 
malfunction. 
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Table  XXXVIII 
Test  Sentence  Analysis 


"Joe  was 

teen  around 

the  airplane 

II 

Word 

Phonemic 

Rendition 

Speaker  2* 

Discrete  Continuous 

Joe 

-0 

-I 

★ 

Was 

-<§- 

-I- 

* 

Seen 

-E- 

-L- 

* 

Around 

@R — D 

II— I 

★ 

The 

-E  or 

-I 

* 

Airplane 

-R 

-L 

* 

Overall  Performance 

Discrete  Score  Continuous  Score 

Speaker (s)  Located  Identified  Located  Identified 

Speaker  2 — |*=100%  — |*=75%  * * 

O O 

^Speaker  1 and  Speaker  3's  test  sentence  and  Speaker  2's 
continuous  speech  not  scorad  due  to  irreversible 
preprocessing  malfunction. 
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Table  XXXIX 


(continued) 


Table  XXXIX- -continued 


Table  XL 


Test  Sentence  Analysis 


"Each  person  at  Beckman  sees  his  responsibility  aimed  toward 
fabricating  better  resistors,  displays  and  drugs." 


Phonemic 

Speaker 

_2* 

Speaker 

_3 

Word 

Rendition 

Discrete  Continuous 

Discrete  i 

Continuous 

Each 

E- 

L- 

0- 

I- 

0- 

Person 

— 

— 

— 

— 

— 

At 

-T 

-I 

-0 

-0 

-0 

OCC'iUOcUl 

1 

Sees 

-E- 

-0- 

-0- 

-0- 

-0- 

His 

— 

— 

— 

— 

— 

Aimed 

A-D 

0-1 

0-1 

0-1 

0-1 

Toward 

T D 

I — I 

I — I 

I — I 

L L 

Fabricating 

— B-§-AT- 

— I-I-OO- 

—I-I-OO- 

—I-I-OO- 

— I-I-OO- 

Better 

B-TR 

I-OI 

I-OI 

1-01 

I-IO 

Resistors 

RE-T — 

LO-I — 

10-0  — 

II-I— 

OL-I — 

Displays 

D A- 

L 0- 

L L- 

I L- 

0 0- 

And 

— D 

— I 

—I 

— I 

— L 

Drugs 

D-@- 

0-1- 

L-I- 

L-I- 

I-I- 

(continued) 


Table  XL — continued 


Test  Sentence  Analysis 

"Each  person  at  Beckman  sees  his  responsibility  aimed  toward 
fabricating  better  resistors,  displays  and  drugs." 


Overall  Performance 

Discrete  Score  Continuous  Score 


Speaker (s) 

Located 

Identified 

Located 

Identified 

Speaker  2 

^=52% 

25 

fr- 

12 

— =48% 

25 

Speaker  3 

H-™ 

^68% 

25 

14  „ 

— 56% 

25 

i*404 

Combined 

30 

5T60% 

If44' 

■k 

Speaker  1 not  scored  due  to  irreversible  data  preprocessing 
malfunction. 


D.  Correlation  Dependency  on  Prototype  Phoneme  Length 

This  appendix  contains  a mathematical  analysis  of  the 
dependency  of  a correlation's  magnitude  when  a column  normal- 
ized prototype  phoneme  is  correlated  with  a column  plus  unit 
normalized  prototype  phoneme.  This  analysis  demonstrates 
that  the  maximum  correlation  of  these  two  "processed"  arrays 
is  a function  of  the  square  root  of  the  length  of  a particular 
prototype  phoneme.  In  addition,  the  analysis  shows  that  the 
maximum  correlation  magnitude  can  be  limited  to  unity  for 
different  length  phonemes. 


Definition  of  Terms: 

1.  Let  a prototype  phoneme  consist  of  a matrix 


P = (Pj-...) j 


where 


P . = 
1 


Pi 


2. 

3. 


Energy  of  P = E (P)  * 


b 3 


(pij} 


Pjll  = d (p±) 2is 


4. 

5. 


Energy  of  P^  = E(P^)  » E (p^) 


Unit  Vector  P . = 
3 


?’i 


where 
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j 


>*«  - p,/li  (p4>  2J* 


The  Column  Normalization  of  P is; 


P.  = P./| | F. 
3 3 3 


The  Column  plus  Unit  Normalization  of  P is; 


P = (K ) . 

3 3 


whe  re 


/s  ~ 


Kj  = Pj/llPl 
and  where 


?ll  - [£ 


Since  | |Pj | | - 1,  I |Pj I I = 1 

h - j 


so 


Therefore,  ||p||  = /"J* 


and  P = (-1-)  (P) 


Assuming  that  somewhere  in  the  sentence  sample  there  is  an 
exact  replica  of  the  prototype  phoneme  P,  this  means  that  the 
correlation  computation  will  be  performed  between  P and  P. 


Correlation  implies: 


{P  • h - ll  (If.  • P.)] 


Maximum  correlation  occurs  when  two  identical  elements  are 
correlated. 
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I 


Maximum  Correlation; 


{P  * P>  - IS  (Pj/^T  ) * (P  j ) J 

= [J  (|P.|2/^1)] 

= [I  (I.O/iTj”)] 

j 

= (r  (/T  /j)  1 

* /"3~  which  is  the  square  root  of  the  length 
of  the  prototype  phoneme. 

In  order  to  insure  that  the  correlation  amplitudes  be 
limited  to  a maximum  value  of  unity,  the  correlation  values 
for  each  phoneme  must  be  divided  by  the  square  root  of  the 
length  of  the  prototype  phoneme. 
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APPENDIX  E 


SPECTROGRAM  OVERPRINT  SCHEME 


E. 


Spectrogram  Overprint  Scheme 


This  appendix  contains  an  explanation  of  the  spectro- 
gram overprint  scheme  used  in  this  research.  The  two  programs, 
OCTAVE 1 and  0CTAVE2,  produced  spectrograms  of  the  16  component 
frequency  vectors  according  to  the  following  procedure.  Each 
component  had  a threshold  to  select  the  proper  number  of 
overprints;  a round-up  procedure  was  used  to  form  integer 
values  and  these  integer  values  correspond  to  the  overprint 
level  of  darkness  shown  in  Table  XLI . 


I 


I 
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Spectrogram  Overprint  Scheme 


Glossary  of  Technical  Terms 


Aliasing:  The  term  aliasing"  refers  to  the  fact  that 

high-frequency  components  of  a time  function  can  im- 
personate low  frequencies  if  the  sampling  rate  is  too 
low . 

Allophone : The  variant  forms  of  a phoneme  as  con- 

ditioned by  position  or  adjoining  sounds. 

Autocorrelation : The  discrete  convolution  of  the  function 

x ( n)  with  x (-n) . Compute  X(k) , the  DFT  of  x(n) , and  _ 
multiply  by  X*(k).  The  inverse  DFT  of  X(k)X*(k)  = X ( k ) ^ 
corresponds  to  the  circular  convolution  of  x(n)  with  x(-n) 
i.e.»  a circular  correlation. 

Crosscorrelation:  The  discrete  convolution  of  the 

function  x(n)  wTth  the  function  y(-n).  Note  above  and 
that  the  DFT  of  y(-n)  is  Y*(k) . 

Dipthong : A combination  of  two  vowels  in  the  same 

syllable , in  which  the  speaker  glides  continuously  from 
one  vowel  to  another. 

Discrete  Fourier  Transform  (DFT) : The  Discrete  Fourier 
Transform  (DFT)  is  defined  as 

(N-l)  .(^L)nk 

F(k)  = I fUTJe"-1  N 
n=o 

where  f(nT)  corresponds  to  equally  spaced  samples  of  an 
analog  time  function  f(t).  Assuming  that  the  sampling 
has  been  done  at  a rate  equal  to  or  higher  than  the 
NyQuist  rate  ( 2 f , where  f is  the  highest  frequency  in 
the  analog  time  function) ,mthen  the  magnitude  of  the 
kth  spectral  point  |F(k) | corresponds  to  the  magnitude 
that  would  be  obtained  at  a time  t = (N-l)T  if  the  sample 
of  the  analog  function  f(t)  were  processed  by  an  analog 
filter  with  a frequency  response  H(w)  given  by: 


sin^(w  - 

<w  ' 


End-Effect:  The  effect  on  computational  results  caused 

by  the  periodicity  imposed  on  a function  by  use  of  the 


4 

I 

1 


I 


8.  Fricatives:  Sounds  produced  by  partial  constriction 
along  the  vocal  tract  which  results  in  turbulence.  The 
sounds  can  be  further  subdivided  into  voiced  and  un- 
voiced categories.  The  voiceless  fricatives  are  pro- 
duced as  a result  of  frictional  modulation.  The  voiced 
fricatives  combine  frictional  with  vocal  cord  and  cavity 
modulation . 

9.  Leakage : The  term  "leakage"  refers  to  the  discrepancy 
between  the  continuous  and  discrete  Fourier  transforms 
caused  by  the  required  time  domain  truncation. 

10.  Morpheme : Any  of  the  minimum  meaningful  elements  in  a 
language,  not  further  divisible  into  smaller  meaningful 
elements,  usually  recurring  in  various  contexts  with 
relatively  constant  meaning,  such  as  a word. 

11.  Nasals : Sounds  that  are  produced  by  allowing  the  air  to 
flow  through  the  nasal  cavities.  Coupling  the  nasal 
cavities  to  the  resonance  system  of  the  vocal  tract  re- 
sults in  nasalized  vowels.  If  the  air  flow  is  restricted 
to  only  flowing  through  the  nasal  cavities,  nasal  con- 
sonants are  produced. 

12.  Phone : An  individual  speech  sound. 

13.  Phoneme:  The  smallest  distinctive  group  or  class  of  phones 
in  a language.  In  a very  general  sense,  the  phonemes  that 
make  up  a speech  sound  can  be  compared  to  the  letters  that 
make  up  a written  word. 

14.  Pitch : The  pitch  of  a sound  with  a periodic  wave  form — 

1 e . , a voiced  sound — is  determined  by  its  fundamental 
frequency,  or  rate  of  repetition  of  the  cycles  of  air 
pressure . 

15.  Plosives : Sounds  that  are  produced  by  a sudden  release 
of  built  up  air  pressure.  The  sounds  can  be  further  dis- 
tinguished by  the  presence  of  absence  of  voicing.  A 
voiceless  stop  occurs  when  the  stop  is  combined  with 
fricative  modulation.  A voiced  stop  occurs  when  vocal 
cord  modulation  is  combined  with  stop  and  fricative 
modulation . 

16.  Template : The  phoneme  employed  for  matching  in  the 
correlation  program. 

17.  Vowels : Sounds  whose  source  of  excitation  is  the  glottis. 
During  vowel  production,  the  vocal  tract  is  relatively 
open  and  the  air  flows  over  the  center  of  the  tongue, 
causing  a minimum  of  turbulence.  The  phonetic  value  of 
the  vowel  is  determined  by  the  resonances  of  the  vocal 
tract,  which  are  in  turn  determined  by  the  shape  and 
position  of  the  tongue  and  lips. 
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